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Abstract 

The  accomplishments  of  the  research  project  "High  Performance  Parallel  Computing" 
for  the  Year  1983  span  algorithm  formulation,  parallel  programming  languages,  basic 
software  for  the  Texas  Reconfigurable  Array  Computer  and  validation  of  design 
concepts  for  the  Texas  Reconfigurable  Array  Computer  (TRAC).  Image  processing, 
sorting  and  time  dependent  partial  differential  equations  were  subjects  for  algorithm 
formulation  and  analysis.  Accomplishments  in  parallel  programming  include: 
substantial  progress  toward  the  implementation  of  two  parallel  programming 
environments,  the  Computation  Structures  Language,  and  a  task  level  data  flow 
programming  system.  The  hardware  prototype  of  TRAC  made  substantial  progress 
towards  stability.  The  state-of-the-art  in  reconfigurable  switch  based  architectures  has 
been  advanced.  A  result  of  note  is  the  demonstration  of  the  integration  of  circuit 
switching  and  packet  switching  in  a  single  interconnection  network. 

1.  Research  Objectives 

This  research  project  was  an  integrated  approach  to  parallel  computation  spanning 
algorithm  formulation,  programming  and  software,  and  hardware/architecture  design 
and  prototyping.  The  goal  of  the  program  was  to  establish  a  flexible  environment  for 
experimental  studies  of  parallel  computations.  The  focus  of  algorithm  formulation  and 
software  development  research  was  the  Texas  Reconfigurable  Array  Computer  (TRAC). 
It  was  realized  during  the  course  of  this  year  that  the  concepts  being  developed  for 
programming  and  software  for  TRAC  applied  to  a  wider  range  of  architectures.  We 
began  to  develop  our  software  systems  with  a  broader  range  of  architectures  in  mind 
during  this  year. 

The  experience  of  building  a  prototype  of  the  Texas  Reconfigurable  Array  Computer 
(TRAC)  has  proved  to  be  highly  rewarding.  It  has  improved  our  understanding  of 
parallel  architectures  and,  we  hope,  has  made  significant  contributions  to  the  area  of 
parallel  processing. 

TRAC  employs  the  novel  concepts  of  reconfigurability  and  space-sharing  in  its 
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organization.  These  are  seen  to  be  the  key  to  the  success  of  the  general  purpose  tfagjhtly 
coupled  multiprocessors.  Many  other  unique  architectural  features  were  included  to 
enable  it  to  perform  equally  well  in  both  numeric  and  non-numeric  applications.  Most 
of  these  features  have  fulfilled  their  promise  while  others  have  brought  to  Right 
important  issues  which  may  demand  further  study.  The  goals  for  the  TMAC 
reconfigurable  architecture  may  be  summarized  as  follows: 


1.  Trivially,  it  must  have  an  organization  to  accommodate  a  large  number  of 


processors. 


2.  It  must  provide  for  different  modes  of  communication  between  the 
processing  units. 


3.  It  must  have  synchronization  mechanisms  general  enough  to  allow  an 
arbitrary  combination  of  processors  to  be  synchronized. 


4.  It  must  be  capable  of  SISD,  SIMD,  and  MIMD  modes  of  execution  (12).  The 
system  should  be  dynamically  reconfigurable  between  tasks  to  support  these 
modes  of  execution  and  to  maximize  the  use  of  system  resources. 


5.  Virtualize  the  computation.  The  system  must  support  vertical  migration 
capability  and  make  underlying  hardware  transparent  to  the  user.  A 
parallel  architecture  should  provide  a  basis  for  implementation  of  parallel 
languages. 


6.  Map  the  architecture  to  the  algorithm.  The  system  organization  should  be 
flexible  enough  so  as  to  be  able  to  mold  the  architecture  to  the  algorithm, 
not  the  algorithm  to  the  architecture  as  has  been  applied  in  the  past.  It 
should  make  available  to  an  algorithm  the  parallelism  that  it  requests  so 
that  its  true  performance  can  be  evaluated. 


7.  Give  attention  to  the  technology.  The  machine  should  be  built  modularly 
with  a  minimum  number  of  unique  partitions.  This  would  facilitate  the 
translation  of  the  design  into  the  emerging  VLSI  technology.  This  last  goal 
will  allow  us  to  assess  the  engineering  decisions  that  went  into  the  design  of 
TRAC  and  to  document  our  experience  with  the  TRAC  development  effort. 
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2.  Research  Accomplishments 

The  research  accomplishments  of  the  project  during  1983  spanned  the  range  from 
algorithm  formulation  to  programming  languages  to  operating  systems  and  finally  to 
validation  of  hardware/architecture  design  concepts.  The  subsections  which  follow  give 
the  principle  research  contributions  under  each  topic. 

2.1.  Algorithm  Formulation 

One  of  the  two  principal  algorithm  formulation  activities  during  this  year  was  the 
establishment  of  optimal  partitioning  schemes  for  time  dependent  partial  differential 
equations.  This  work  is  discussed  in  a  report  by  E.  J.  Shipsey  [SHI83].  A  second 
algorithm  study  was  study  of  the  relative  efficiency  of  packet  switching  and  circuit 
switching  architectures  for  realization  of  histogramming  and  smoothing  algorithms  for 
image  processing.  A  paper  on  this  research  was  published  in  the  Proceedings  of  the 
1983  International  Conference  on  Parallel  Programming  [YAS83].  The  principal 
findings  are  that  circuit  based  architectures  will  become  more  efficient  as  the  image 
resolution  and  thus  the  volume  of  data  to  be  processed  increases. 

Ramakrishnan  and  Browne  [RAM83],  in  research  performed  under  the  1982  AFOSR 
grant  of  the  same  title  but  not  reported,  developed  a  paradigm  for  the  design  of  parallel 
algorithms  for  SIMD  computer  arrays.  This  research  explores  the  class  of  algorithms 
which  can  be  created  by  combining  computational  and  data  movement  functions  in  a 
single  abstract  machine  instruction.  The  results  obtained  included  a  new  algorithm  for 
merging  on  a  bidirectional  pipeline  of  processors. 

2.2.  Parallel  Programming  Languages 

The  year  1983  saw  substantial  advance  in  the  development  of  the  run-time  support 
system  for  the  Computation  Structures  Language  (CSL).  The  design  for  the  run-time 
system  of  CSL  was  completed,  implemented  and  partially  debugged  during  this  time 
period.  Implementation  is  taking  place  on  the  Dual  Cyber  170-750’s  of  the  University 
of  Texas  Computation  Center.  A  design  for  a  task  level  data  flow  language  which  will 
complement  the  capabilities  of  CSL  was  initiated  during  1983. 
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The  two  underlying  principles  upon  which  we  are  basing  our  parallel  programming 
languages  are  the  separation  of  representation  of  dependency  relations  from 
computations  and  the  recognition  that  all  parallel  computations  can  be  expressed  as 
directional  graphs. 

The  Computation  Structures  Language  is  a  text  string  language  for  specification  of 
computation  graphs  and  for  specification  of  explicit  traversal  paths  for  the  computation 
graphs  to  execute  the  computation.  The  task  level  data  flow  language  uses  the  same 
principle  of  separate  units  of  computation  from  dependency  relations,  but  implements 
an  implicit  traversal  of  the  computation  graph. 

2.3.  Operating  System  for  TRAC 

1983  saw  the  completion  of  the  design  for  the  operating  system  for  the  Texas 
Reconfigurable  Array  Computer.  This  research  was  primarily  executed  by  Mr.  Daniel 
Canas. 

Several  unique  problems  arose  from  this  research.  The  first  of  these  problems  was  the 
integration  of  a  virtual  memory  architecture  into  the  reconfigurable  memory  structure 
of  TRAC.  The  capability  of  the  TRAC  architecture  for  switching  memory  units 
between  processor  configurations  is  a  powerful  means  of  sharing  of  memory. 

The  mode  of  moving  a  memory  unit  between  processor  configurations  is  to  generate 
an  interrupt  when  a  processor  attempts  access  to  an  address  which  is  in  a  "shared" 
memory  module  that  is  not  currently  attached  to  the  requesting  process.  The  interrupt 
service  routine  realizes  its  request  by  establishing  a  circuit  to  the  memory  board  holding 
the  requested  address.  A  virtual  memory  page  fault  can  be  created  in  the  same  manner. 

The  result  is  the  establishment  of  a  unification  of  virtual  memory  and  reconfigurable 
memory  architectures  for  TRAC.  The  techniques  developed  here  can  be  applied  to  page 
fault  handling  in  a  conventional  demand  driven  page  environment  if  paging  is  via  a 
switchable  memory  configuration. 


2.4.  Hardware  Design  Concepts 

There  have  been  quite  a  number  of  accomplishments  directly  resulting  from  work 
during  the  year  on  the  design  and  implementation  of  the  prototype.  These  will  be 
covered  subsequently.  The  memory  and  the  input/output  devices  are  connected  to  the 
base  nodes  of  the  banyan  network.  The  TRAC  architecture  distinguishes  between  the 
two  kinds  of  devices  and  has  separate  interfaces  for  both.  The  memory  modules  employ 
the  Primary  Memory  Interface  while  the  input/output  devices  use  the  Auxiliary 
Resource  Interface  (ARI).  One  significant  accomplishment  during  this  year  was  the 
definition  and  implementation  of  the  ARI  for  support  of  terminal  and  disk  I/O. 

The  ARI  access  has  allowed  I/O  programming  to  be  device-dependent  at  the  user 
level.  The  transfers  between  an  Auxiliary  Resource  (I/O  Device)  and  the  primary 
memory  were  implemented  via  descriptor  based  instructions.  The  descriptors  have  the 
same  general  format,  although  their  contents  are  specific  to  the  device  being  processed. 
The  calling  sequences  of  the  instructions  are  independent  of  the  device  being  addressed, 
making  the  hardware  details  of  the  underlying  device  transparent  to  the  user.  Thus  the 
concept  of  ARI  has  become  central  to  the  virtualization  of  I/O  in  TRAC.  (The  concept 
is  not  dissimilar  to  that  of  the  dev  file  in  UNIX).  Also,  The  actual  transfers  of  data 
between  the  device  and  the  primary  memory  are  complementary,  allowing  transfer  of 
data  during  every  memory  cycle.  The  ARI  concept  has  already  been  used  to  connect 
devices  such  as  terminal  (14),  printer,  disk,  self-managed-secondary-memory  (15),  and 
the  control  port  (14)  to  TRAC. 

A  second  major  milestone  was  the  successful  implementation  of  the  banyan 
interconnection  network.  It  can  be  considered  to  be  the  most  important  contribution  of 
the  TRAC  project  to  date.  It  is  a  two-sided,  multistage  network  with  processors  at  the 
apex  end  and  memories  or  input/output  devices  at  the  base.  It  has  been  built 
modularly  with  unique  partitioning  properties;  it  has  been  built  using  a  single  building 
block  called  the  switch  module.  The  switch  module  itself  is  easily  segmentable  and 
amenable  to  VLSI  implementation. 


The  primary  purpose  of  a  close-coupled  computer  network  is  to  provide  mechanisms 
for  processor-memory  and  processor-processor  communication  and  those  for 
interprocessor  synchronization.  The  performance  of  algorithms  is  directly  related  to  the 
effectiveness  of  these  mechanisms.  Most  of  the  unique  characteristics  of  the  TRAC 
architecture  accrue  from  the  capabilities  of  the  supporting  interconnection  network. 
The  network  supports  both  packet  and  circuit  switched  modes  of  data  movement.  The 
packets  are  essential  for  implementing  an  arbitrary  permutation  on  a  blocking  network 
while  also  furnishing  asynchronous  communication  between  the  processing  elements. 
The  packet  communication  facility  provides  a  means  for  intra-task  data  permutation, 
intertask  communication  and  operating  system  message  interface.  The  circuit  switched 
modes  of  interprocessor  communication  occur  in  TRAC  in  the  form  of  shared  and 
instruction  trees.  It  is  believed  that  the  presence  of  both  circuit  switched  and  packet 
switched  modes  of  communication  is  necessary  to  produce  the  best  performance. 

It  was  during  this  year  that  prototype  validation  of  both  modes  of  communication  was 
accomplished.  This  resulted  from  exhaustive  testing  of  the  interconnection  network. 
Programs  utilizing  both  circuit  switched  shared  memory  and  packet  transmission  for 
intra-task  communication  were  successfully  executed. 

During  this  period  the  Control  Port  was  designed  and  implemented.  It  is  the  interface 
between  an  arbitrary  TRAC  processor  and  the  Network  Controller  which  is  responsible 
for  the  generation  of  Data,  Instruction,  and  Shared  Memory  trees.  The  Control  Port 
was  tested  and  performs  at  the  design  goal  of  1  MHz  (cycle  time).  All  aspects  of  cold 
and  warm  restart  were  shown  to  be  100%  functional. 

The  processor  microcode  space  was  expanded  from  2K  to  8K  allowing  for  increased 
code  space  along  with  support  for  an  external  arithmetic  processing  unit  (APU)  with 
floating  point  operations.  This  restructure  provided  more  flexible  microbranch 
instructions  and  the  removal  of  wasteful  duplicated  microcode. 

A  number  of  tools  were  generated  in-house  to  help  develop  hardware  and  microcode 
for  TRAC,  and  were  used  in  addition  to  the  traditional  tools  such  as  the  high 
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bandwidth  oscilloscopes  and  logic  analyzers.  These  tools  were  built  around  a  Z-80  based 
Cromemco  microcomputer.  The  tools  included  monitors,  testers,  and  microprogram 
development  aids.  These  are  discussed  in  more  detail  in  the  following  paragraphs. 

Two  monitor  programs,  MOE  and  RABUG,  were  written  for  the  Cromemco  System  to 
help  develop  the  hardware  and  to  facilitate  debugging  of  the  microcode.  A  parallel 
interface  was  built  which  allows  the  Cromemco  microcomputer  to  read  and  write  to  the 
network  interface  busses  and  the  micro-address  busses  of  all  the  processor  modules. 
Through  separate  interface,  the  micro-computer  is  able  to  control  the  clock.  By  proper 
utilization  of  these  interfaces,  the  monitor  programs  running  on  the  Cromemco  can  step 
through  the  phases,  micro-cycles  and  memory  cycles,  and,  in  addition,  can  read  data 
from  or  write  data  to  the  busses  in  the  TRAC  system. 


Monitor  program  MOE  has  a  capability  to  sequence  through  a  specified  number  of 
phases,  micro-cycles,  machine  cycles,  and  TRAC  instructions,  and  display  or  print  the 
data  from  the  busses,  and  display  the  contents  of  the  memory  pointer  registers  and  the 
processor  status  registers  for  individual  processors.  To  facilitate  this,  instrumentation 
was  added  at  the  microprogram  level  to  output  the  required  information  on  the  network 
bus.  At  the  beginning  of  each  instruction,  a  microprogram  routine  is  executed  which 
supplies  the  processor  status  information  and  memory  pointer  information  to  the 
monitor  program,  which  then  displays  it  on  the  CRT  screen.  The  monitor  program  is 
also  capable  of  receiving  its  commands  from  a  batch  file.  The  batch  files  are  created  to 
run  the  machine  through  an  entire  program  and  list  bus  data  or  processor  status  any 
number  of  times.  This  way  it  is  possible  to  exercise  the  machine  for  long  periods  of 
time  and  capture  faults  if  they  occur. 

Monitor  program  MOE  mentioned  above,  explicitly  controlled  the  clock,  executing 
considerable  Z-80  code  for  each  TRAC  clock  step.  As  a  result,  the  TRAC  hardware  was 
exercised  only  at  slow  speeds.  After  all  functionalities  of  the  architecture  were 
developed  and  tested,  a  need  was  experienced  for  a  more  sophisticated  monitor  which 
would  run  the  system  at  the  rated  speeds  of  10kHz,  100kHz  or  1  MHz  and  still  retain 
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the  debugging  capability.  Coupling  logic  was  added  to  enable  the  TRAC  processors  and 
the  Cromemco  system  to  hand-shake  and  to  allow  the  monitor  program  RABUG  to 
switch  between  free  running  and  controlled  stepping  of  the  system  clock.  A  capability 
was  also  added  so  that  the  monitor  could  send  commands,  in  addition  to  data,  to  the 
memory  modules.  This  latter  capability  has  proved  to  be  helpful  for  software 
debugging,  since  via  the  monitor,  the  contents  of  the  memory  can  be  inspected  and/or 
modified.  It  is  now  also  possible  to  insert  break  points  in  the  programs  to  further 
facilitate  their  testing.  As  a  result,  RABUG  now  provides  a  multiprocessor  debug 
facility  for  both  microcode  and  TRAC  machine  code  program  validation. 

2.5.  Software  Development  Tools 

A  software  simulator  for  TRAC  was  written  to  help  develop  system  and  application 
software  while  the  hardware  was  under  construction.  The  software  simulator  is  able  to 
provide  the  parallel  programming  environment  available  on  TRAC.  It  is  also  able  to 
simulate  the  shared  tree  concept  and  the  packet  communication.  A  Pascal  compiler,  an 
assembler,  and  a  loader  have  also  been  developed  for  the  TRAC  system.  These 
programs  can  also  generate  code  that  can  be  interpreted  by  the  TRAC  simulator. 
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COMPUTATIONAL  ORGANIZATION 

FOR 

PARALLEL  COMPUTATION: 
THE  TIME  EVOLUTION  OF 
PHYSICAL  SYSTEMS 


E.  J.  Shipsey 
Department  of  Physic* 

The  University  of  Texts  tt  Austin 
Austin,  Texts  78712 

1.  Introduction 

The  state  of  t  physktl  system  is  described,  in  genera),  by  a  set  of  quantities  characterising  its  condition. 
These  physical  qualities  are  described  mathematically  as  functions  defined  on  a  physical  space  in  which 
the  system  is  embedded.  The  behavior  of  the  physical  system  as  time  advances  is  described  by  a  set  of 
partial  differential  equations.  The  state  of  a  system  is  given  when  all  the  functions  characterizing  its 
condition  are  known  for  all  points  in  space  at  that  given  time.  Thus  the  behavior  of  the  system  is  given 
by  a  set  of  functions  defined  on  some  space  which  are  continually  changing  with  time. 

A  more  general  view  may  be  abstracted  from  this  notion.  Instead  of  time,  a  more  general  propagation 
variable  may  be  considered.  If  the  partial  differential  equations  describing  the  physical  system  are  defined 
in  terms  of  N  variables,  the  functions  characterizing  the  system  can  be  viewed  in  a  N-l  dimensional  space 
as  the  remaining  independent  variable  takes  successive  values.  In  this  way  the  system  is  imagined  to 
evolve  in  an  N-l  dimensional  space  as  viewed  by  an  "observer*  travelling  along  the  remaining  coordinate 
axis.  Computationally  propagation  implies  that,  in  principle,  if  the  functions  characterizing  the  physical 
system  are  known  for  some  values  of  the  propagation  variable  they  can  be  computed  at  a  value  of  the 
propagation  variable  which  in  some  sense  is  further  along  than  the  values  for  which  the  function  is  already 
known.  Conceptually  and  operationally  the  propagation  variable  is  associated  with  the  unfolding 
development  of  the  physical  system.  It  may  be  possible  to  find  more  than  one  propagation  variable  for  a 
given  physical  system.  The  option  then  arises  of  selecting  a  geometry  which  will  give  maximum  efficiency 
or  stability  for  a  given  purpose.  An  example  of  such  a  situation  is  given  in  Section  IV.  Ideally,  the 
propagation  variable  should  be  selected  so  that  the  computation  of  the  functions  characterizing  the  system 
is  as  simple  as  possible,  aud  that  old  values  of  the  functions  are  required  at  very  few  other  values  of  the 
propagation  variable. 

The  computational  process  is  further  analyzed  by  considering  the  N-l  dimensional  subspace  (the 
computational  level)  which  results  when  the  propagation  variable  is  held  at  a  fixed  value.  The  objective  is 
to  find,  in  some  sense,  regions  of  computational  independence.  The  independence  is  of  a  restricted  nature 
because,  of  course,  the  entire  space  is  quite  intimately  involved  in  the  long  time  development  of  the 
system.  An  independence  of  soils  can  be  conceived  for  sufficiently  short  time  intervals  and  of  a  qualified 
local  nature.  This  is  accomplished  by  means  of  partitioning.  The  computational  level  is  divide!  into 
regions  which  are  called  computational  subdomains.  To  each  computational  subdomain  is  added  a  border. 
The  border  is  understood  to  contain  only  points  contained  in  an  adjacent  computational  subdomain.  The 
region  of  the  computational  level  consisting  of  the  computational  subdomain  plus  its  associated  border  is 
spoken  of  as  a  computational  domain.  The  computational  domains  can  be  considered  to  be 
computationally  independent  if,  for  a  sufficiently  small  increment  in  the  propagation  variable,  all  the 
dependent  functions  describing  the  system  can  be  computed  for  all  points  in  the  computational  subdomain 


at  an  advanced  value  of  the  propagation  variable  using  only  their  value*  at  point*  in  the  computational 
domain  (computational  subdomain  and  its  border)  at  the  previous  value  of  the  propagation  variable.  The 
notion  of  computational  indepi  ndence  thus  defined  depends  on  the  numerical  method  adopted  as  well  as 
the  physical  description  and  the  mathematical  statement  of  the  problem  The  numerical  meth<>d  selected 
depends  in  turn  on  considerations  of  numerical  stability.  The  difference  in  the  numerical  stability 
properties  of  first  and  second  order  partial  differential  equations  is  the  principal  reason  for  the  difference 
in  the  partitionings  adopted  for  the  problems  discussed  in  Sections  11  and  V. 

The  analysis  of  the  computational  process  is  completed  by  means  of  a  classification  system  for  the 
independent  variables  of  the  physical  system.  The  three  classes  of  variable*  are;  i)  propagation,  ii) 
communication  and  iii)  internal  labelling.  The  first  has  already  beeu  discussed.  The  last  tv.o  are 
dependent  upon  the  mode  of  partitioning  selected  If  an  independent  variable  of  the  system  passes  from 
one  computational  subdomain  to  another  as  it  takes  on  all  its  values  it  is  said  to  be  a  communication 
variable.  If,  on  the  other  hand,  an  independent  variable  of  the  system  which  is  not  a  propagation  variable 
remains  in  a  computational  subdomain  as  it  takes  on  all  its  values,  it  is  an  internal  labelling  variable. 

The  computer  organization  for  parallel  computation  follows  directly  from  the  partitioning  and  its 
associated  scheme  of  variable  classification.  The  computational  domains  correspond  directly  to  the 
memories  of  a  set  of  independent  processors.  The  computation  proceeds  in  steps  of  one  increment  of  the 
propagation  variable  at  a  time.  After  each  computation  is  completed  data  is  transferred  between 
memories  to  fill  in  the  information  needed  on  the  borders  of  the  computational  domain  at  the  new  value 
of  the  propagation  variable.  Each  processor  is  then  ready  to  perform  another  step  of  the  computation. 
The  process  is  repeated  over  and  over  until  the  final  value  of  the  propagation  variable  has  been  attained. 

The  physical  description,  mathematical  statement  and  numerical  procedure  thus  produce  the 
computational  organization  through  the  mediation  of  the  partitioning  process  together  with  its  associated 
scheme  of  variable  classification.  The  actual  work  of  organization  is  carried  out  in  the  partitioning 
process.  The  problem  of  finding  the  optimal  computational  structure  for  a  given  formulation  of  a  problem 
is  thus  one  of  finding  the  optimal  partitioning  geometry.  The  partitioning  process,  however,  is  only  an 
intermediate  step  in  deriving  an  efficient  computational  structure.  The  search  for  an  optimal  computation 
structure  actually  begins  with  the  problem  formulation  itBelf.  Problem  formulation  can  be  considered  to 
be  comprised  of  the  three  aspects;  physical,  mathematical  and  numerical.  Seemingly  minor  reformulations 
can  have  quite  drastic  effects  on  computational  structures.  A  little  inright  in  the  initial  phases  of 
development  can  often  reap  greater  efficiencies  than  the  most  ingenious  geometrical  creations  in  the  latter 
stages.  In  Section  III  an  example  is  given  in  which  a  slight  modification  in  the  mathematical  formulation 
allows  parallel  computations  to  be  carried  out  in  special  cases. 

The  essential  feature  of  propagation  and  communication  variables  is  that  the  numerical  problem  which 
must  finally  be  solved  must  contain  very  sparse  coupling  in  these  variable*  and  be  highly  explicit.  The 
coupling  might  be  seen  in  the  mathematical  formulation  but  becomes  more  apparent  in  the  numerical 
analysis.  Some  examples  of  couplings  which  prevent  variables  from  being  communication  or  propagation 
variables  are  given  in  Seciion  111.  In  order  that  partitioning  of  a  computational  level  be  possible  it  is 
apparent  that  only  a  small  region  in  the  communication  variable  subspace  must  be  involved  in  tbe 
calculation.  The  most  immediate  requirement  of  the  numerical  system  to  be  solved  is  that  only  nearby 
values  (in  the  directions  of  tb<*  communication  variables)  be  coupled  together.  In  addition,  the  solution 
process  itself  must  not  spread  the  calc  ulation  into  other  domains  of  tbe  partitioning.  This  latter  condition 
is  quite  restrictive  and  actually  means  that  the  numerical  scheme  expresses  tbe  functions  describing  tbe 
physical  system  at  one  particular  point  in  the  computational  level  in  terms  of  'neighboring*  points  in 
earlier  computational  levels  or  at  only  a  very  specialized  class  of  points  in  the  advanced  level.  This 
requirement  which  arises  from  considering  partitioning  in  tbe  communication  variable  subspace  actually 


furnishes  the  crucial  test  of  the  propagation  variable  selection.  Explicit  numerical  schemes  are  not  always 
stable  or  may  require  extremely  small  step  sites  to  achieve  stability,  so  that  the  number  of  systems  which 
can  be  treated  in  this  way  are  limited. 

The  problems  most  readily  organised  in  this  manner  belong  to  the  class  of  problems  for  the  time 
evolution  of  physical  systems.  The  equation  which  will  be  studied  in  the  next  section  is  the  classical 
Liouville  equation.  This  equation,  besides  its  own  intrinsic  interest,  allows  the  partitioning  procedure  to  be 
analysed  in  the  simplest  possible  context.  Section  III  deals  with  the  electrostatic  Vlasov  equation.  The 
Vlasov  equation  cannot  be  treated  in  general  by  the  techniques  discussed  in  the  present  work.  Two  special 
instances  are  given  which  allow  the  mathematical  restatement  of  the  problem  for  present  purposes.  Laser 
puke  propagation  is  discussed  in  Section  IV  and  a  further  elaboration  of  the  partitioning  procedure 
introduced.  Finally,  in  Section  V  the  two  dimensional  diffusion  equation  which  exhibits  some  difficulties 
arising  from  second  partial  derivatives  is  described. 


2.  Nearest  Neighbor  Communication,  The  Simple  Example  of  the 
Classical  Liouville  Equation 

The  Liouville  equation  is  a  first  order  homogeneous  partial  differential  equation.  Physically  the  equation 
describes  the  evolution  in  time  of  a  distribution  in  phase  space.  That  is,  if  a  mechanical  system  can  be 
described  in  terms  of  3N  spatial  coordinates,  a  possible  state  of  the  system  is  represented  (classically)  by  a 
point  in  a  6N  dimensional  space  consisting  of  the  3N  spatial  coordinates  and  the  3N  associated  momenta. 
The  collection  of  all  possible  states  satisfying  some  prescribed  conditions  can  be  described  in  terms  of  a 
density  in  the  6N  dimensional  space,  which  is  called  the  phase  space  of  the  system.  The  Liouville  equation 
describes  the  time  history  of  a  distribution  in  phase  space  which  has  some  prescribed  form  at  the  initial 
time.  The  equation  thus  represents  a  first  order  initial  value  problem  which,  aside  from  its  dimensionality, 
might  represent  the  simplest  possible  mathematical  system. 

The  classical  Liouville  equation  can  be  written1 

dp/9 1  -»  E  (pjmt  dp/dqi  -  flV/Aj.  dp/ dp)  (1) 

where  i  refers  to  a  particular  degree  of  freedom  associated  with  coordinate  qi(  momentum  p;  and  mass  mj. 
The  forces  are  assumed  to  be  derivable  from  a  potential  energy  function  denoted  by  V. 

A  simple  example  is  a  harmonic  oscillator  of  one  degree  of  freedom.  This  system  is  exactly  solvable  and 
furnishes  a  convenient  test  case  for  numerical  techniques.  The  parameterless  form  of  the  equation  is 

dt/dt  +  ydt/dx  -  x  dt/dy  -  0  (2) 

with  solution 

f(xjr,t)  ■*  g(xcost  -  ysint,  xsint  +  ycost)  (3) 

where 

f(x.y.o)  -  g(x,y)  (4) 

is  the  initial  condition.  The  initial  condition,  and  the  differential  equation,  is  such  that  the  motion  is 
bounded,  that  is  f(x,y,t)  also  satisfies  the  boundary  condition 

f(x,y,t)  -*  0  if  |x|  or  |y|  -*  oo.  (5) 


A  practical  numerical  technique  for  first  order  initial  value  problems  is  approximation  by  a  truncated 
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power  series  in  time, 

f(t+fit)  -  f(t)  +  fit  f(t)  +  (fit*/2)  f(t)  (6) 

wbere  dots  denote  partial  differentiation  with  respect  to  time.  The  first  derivative  with  respect  to  time  is 
obtained  directly  from  tbe  partial  differential  equation  itself,  Eq(2),  and  tbe  second  derivative  is  obtained 
by  differentiating  this  equation  once  with  respect  to  time.  This  procedure  can  be  repeated  arbitrarily 
many  times,  but  tbe  resulting  expression  becomes  too  complex  to  be  useful.  Tbe  ntk  time  derivative  will 
involve  n(k  order  spatial  derivatives  wbicb  increases  tbe  numerical  complexity  also.  Tbe  other  partial 
derivatives  are  approximated  by  finite  difference  procedures.  If  h  is  tbe  spacing  between  grid  points,  these 
approximations  are,  to  fourth  order  in  h, 

dt/dx  =  (2/3h){f(x+h)-f(x-h)  -  l/8[f(x+2h)-f(x-2h)|}  (7) 

and 

\ 

^f/flx*  *  4/3b2(f(x+h)+f(x-h)  -  l/16[f(x+2h)+f(x-2b)| 

-  15/8  f(x)}  (8) 

as  can  easily  be  verified  by  power  series  expansion.  Tbe  cross  derivative  is  obtained  by  applying  Eq(7) 
twice. 

Stability  analysis  of  simple  numerical  schemes  for  simple  initial  value  problems  *»*  shows  that  error 
propagation  will  be  stable  in  such  schemes,  provided 

v  fit/fis  <  1  (0) 

where  v  is  some  characteristic  velocity  and  s  represents  each  of  the  independent  variables  except  time. 
The  procedure  above  suggests  itself  as  a  means  of  utilising  large  enough  displacements  in  the  non¬ 
temporal  variables  to  obtain  numerical  stability,  while  also  obtaining  numerical  accuracy.  Its  practical 
success  has  been  verified  by  numerical  computation.  An  alternative  numerical  scheme  accurate  to  second 
order  in  the  variables  At  and  Ax  is  illustrated  for  tbe  very  simple  equation 

du/dt  —  du/dx  (10) 

by 

u(x,t+At)-u(x,t)  =  (At/4Ax){u(x+Ax,t+At) 

-u(x-Ax,t+At)+u(x+Ax,t)-u(x-At)}.  (11) 

The  value  of  u  at  the  points  ( . x-2Ax,x-Ax,x,x+Ax,x+2Ax...}  at  the  new  value  of  time,  t  +  At,  now 

requires  the  solution  of  a  linear  system  of  equations.  This  is  an  example  of  an  implicit  scheme,  whereas 
the  scheme  developed  from  Eq(6)  represents  an  explicit  scheme.  The  present  results  seem  to  be  that  the 
explicit  scheme  described  above  requires  much  less  computational  labor  than  such  implicit  schemes. 

The  explicit  scheme  adopted  moves  the  function  f  forward  in  time  with  step  in  time  of  site  fit.  The 
truncated  power  series  (Eq(6))  is  used  with  time  derivatives  furnished  by  the  partial  differential  equation 
itself.  The  other  derivatives  required  are  given  finite  difference  formulas  Eqs  (7)  and  (8).  Inspection  of 
these  finite  difference  expressions  reveals  that  fewer  points  can  be  computed  at  time  t  +  fit  than  were 
available  at  time  t.  In  other  words  if,  for  instance,  the  area  of  the  computational  grid  is  a  square,  the 
points  on  the  border  and  next  to  the  border  cannot  be  computed  at  tbe  next  instance  in  time.  If  there  are 
2n  +  1  non-temporal  points  initially  in  each  direction  this  means  that  after  n  steps  in  time  tbe  procedure 
can  only  supply  the  function  at  a  single  point. 

This  difficulty  is  overcome  for  distributions  satisfying  Eq.  (5)  by  making  the  approximate  solution  zero 
at  the  boundary  points  and  at  the  points  neighboring  the  boundary.  At  first  glance  this  procedure  may 
seem  somewhat  arbitrary  as,  in  effect,  the  normal  derivative  of  the  function  as  well  as  the  function  itself  is 


set  equal  to  zero  at  the  boundary.  Numerical  studies,  however,  show  that  for  a  sufficiently  large  domain 
of  approximation  no  problems  arise.  The  higher  order  approximation  (with  the  function  set  equal  to  zero 
on  the  boundary  and  its  neighboring  points)  has  been  compared  to  an  approximation  using  lower  order 
finite  difference  expressions  for  the  non-temporal  derivatives  (which  requires  only  that  the  approximate 
solution  be  zero  on  the  boundary  itself)  and  found  to  be  much  superior.  Distribution  functions  in  both 
coordinate  and  momentum  variables  are  always  required  to  vanish  as  the  momentum  becomes  infinite. 
(Physical  problems  with  entities  moving  at  infinite  velocity  usually  have  no  significance.)  The  techniques 
described  above  should  always  be  applicable  to  the  momentum  (or  velocity)  variables.  Other  boundary 
conditions  may  be  applied  to  the  spatial  coordinates,  however.  If  the  distribution  function  is  only  strictly 
defined  in  a  finite  spatial  domain  (reflecting  walls  for  example),  clearly  the  present  methods  are 
nonapplicable.  Periodic  boundary  conditions,  for  example 

p(— <li+Lj,pj...)  =  p(...qi,Pi...)  (12) 

where  L;  is  the  period  of  the  q^  coordinate  can  be  handled  by  an  obvious  modification. 

The  process  of  generating  an  approximate  solution  to  Eq(2)  can  be  envisaged  as  follows.  A  region  of  the 
x,y  plane  is  selected  suitably  large  that  f(x j,t)  can  be  assumed  to  be  zero  on  the  boundary.  A  grid  with 
spacing  Ax  in  the  x  direction  and  Ay  in  the  y  direction  is  laid  out  on  the  region.  The  approximate 
function  is  assigned  the  value  zero  on  the  boundary  points  and  the  points  neighboring  the  boundary.  The 
approximation  retains  these  values  for  all  times.  The  initial  values  of  f(x,y,t)  are  computed  at  all  the 
other  interior  grid  points.  Using  these  points  which  represent  f(xj,t)  at  time  equal  zero,  f(x,y,t)  is 
computed  at  time  equal  ft  by  means  of  Eqs(6),  (2),  (7)  and  (8).  It  is  clear  that  many  of  these  last 
computations  are  quite  independent  of  each  other. 

This  last  step  can  be  partitioned  into  several  independent  computations,  if  the  space  of  computation  is 
.'ivided  into  subregions  with,  however,  redundant  points  added  to  the  boundaries  as  are  required  by  the 
ca.vJation.  This  is  shown  in  Fig.  1.  The  columns  of  the  subregions  in  Fig.  1  correspond  to  values  of 
f(x,y,t)  for  x  and  t  fixed.  Time  is,  of  course,  constant  through  the  region  and  x  varies  in  the  horizontal 
direction.  The  double  border  of  the  entire  region  is  shown  filled  in  with  zeros,  but  this  need  not  be  done 
explicitly.  The  leftmost  region  (it  may  be  thought  of  as  an  independent  processor  together  with  an 
associated  memory)  provides  the  means  of  advancing  columns  3  through  6,  the  center  region  columns  7 
through  12,  and  the  rightmost  region  13  through  16.  Only  one  time  step  can  be  made  since  not  all  the 
columns  in  each  region  are  updated  in  this  time  step.  Accordingly,  updated  values  of  f(x jr.t)  are  moved 
from  subregion  to  subregion  as  required  to  form  the  boundary  and  neighboring  boundary  values  in  each 
subregion  for  the  next  step.  The  columns  required  are  shown  crosshatched  in  Fig.  1  and  the  columns 
updated  which  supply  these  values  are  shown  shaded  diagonally  in  Fig.  1.  The  data  movement  is  shown 
by  the  arrows.  The  partitioned  calculation  is  seen  to  proceed  in  two  steps.  The  first  is  the  time 
propagation  of  f(x,y,t)  in  each  independent  subregion  (which  can  be  carried  out  simultaneously)  and 
second  the  transfer  of  data  between  subregions  to  prepare  for  the  next  step.  The  essential  feature  of  the 
Liouville  equation  for  the  one  dimensional  harmonic  oscillator  is  that  data  transfer  is  only  needed  between 
adjacent  subregions. 

A  slightly  more  complicated  situation  arises  from  a  periodic  one  dimensional  Liouville  equation  such  as 
dt/d t  +  y  d/dx  +  sin(x)  dt/dy  =■  0.  (13) 

Here  the  function  is  periodic  in  x.  The  partitioning  is  shown  in  Fig.  2.  The  new  feature  is  the  transfer  of 
data  from  one  end  of  the  region  to  the  other.  A  simplification  immediately  suggests  itself.  If  the  variable 
y  is  arranged  to  vary  across  the  columns,  and  x  varies  across  the  rows,  the  periodicity  will  be  contained 
inside  each  of  the  subregions  of  the  partition  and  (with  the  appropriate  modification  of  the  internal 
conditions)  the  situation  described  by  Fig.  1  is  achieved. 
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The  independent  variables  of  a  numerical  procedure  for  solution  of  a  partial  differential  equation  are 
seen  in  this  partitioning  context  to  have  three  roles;  i)  propagation,  ii)  communication,  and  iii)  internal 
labelling.  In  both  these  examples  the  time,  t,  is  solely  concerned  with  propagation.  In  the  example  of  the 
one  dimensional  harmonic  oscillator  as  discussed  above,  x  is  both  a  communication  variable  and  an 
internal  labelling  variable  (since  f(x,y,t)  is  being  computed  at  more  than  one  value  of  x  in  each  subregion). 
Finally  in  the  example  of  the  one  dimensional  harmonic  oscillator  as  discussed  above  y  is  solely  concerned 
with  internal  labelling.  As  illustrated  by  the  second  example  of  periodic  one  dimensional  motion, 
reassignment  of  the  variable  roles  can  lead  to  simpler  computational  requirements.  One  of  the  principal 
tasks  of  parallel  programming  may  be  said  to  be  that  of  assigning  the  optimum  role  to  each  variable. 

Further  subdivision  of  the  computational  domain  can  be  considered.  In  Fig.  3  partitioning  in  both  the  x 
and  y  directions  is  shown  for  the  one  dimensional  harmonic  oscillator.  The  boundary  condition  is  again 
shown  on  the  extreme  borders.  The  movement  of  data  is  again  indicated  by  arrows.  No  diagonal 
movement  is  required  if  the  data  is  moved  in  proper  sequence.  The  vertical  moves  are  made  first.  That 
is,  the  rows  containing  unshaded  plus  cross  hatched  areas  are  moved  vertically  to  the  neighboring 
subregion  and  form  the  unshaded  and  stippled  areas  of  the  border.  The  horizontal  moves  are  made  last. 
The  internal  diagonal  shaded,  cross  hatched,  and  recently  arrived  stippled  areas  are  moved  horizontally  to 
form  the  diagonally  shaded  areas  of  the  border  of  the  neighboring  subregion.  The  data  in  the  cross 
hatched  area  is  seen  to  be  moved  twice  and  to  fill  the  corner  of  the  border  in  the  diagonally  opposite 
subregion.  By  means  of  this  double  movement  the  diagonal  movement  required  to  fill  the  corners  of  the 
border  has  been  achieved. 
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The  two  dimensional  subdivision  can  be  thought  of  as  carried  to  its  ultimate  limit  if  only  four  points  in  a 
square  array  remain  in  the  computational  region.  (In  Eqs(7)  and  (8)  a  five  point  finite  difference  scheme  is 
used,  if  a  three  point  scheme  were  employed  the  ultimate  limit  of  the  computational  area  would  be  a 
single  point.)  This  is  shown  in  Fig.  4.  The  data  transfer  areas  have  completely  coalesced  and  filled  the 
computational  area  which  is  the  central  square  containing  four  points.  The  boundary  condition  is  shown 
by  the  zeros  and  the  data  moves  are  again  indicated  by  arrows.  All  the  vertical  moves  are  made  first  and 
only  a  single  square  of  four  points  is  moved.  All  the  horizontal  moves  are  made  last  and  each  time  the 
central  column  of  non-zero  data  is  moved.  In  the  first  sequence  of  moves  the  middle  squares  on  the 
horizontal  borders  of  the  subregions  are  filled  and  in  the  second  sequence  the  remaining  vertical  sides  of 
the  subregion  borders  are  filled. 


The  notion  of  an  ultimate  partitioning  is  useful  in  classifying  a  physical  problem  and  indicates  just 
exactly  how  much  parallelism  is  inherent  in  a  mathematical  structure.  The  idea  of  an  ultimate  partition 
has  more  utility  in  more  complicated  situations.  Consider  again  the  general  Liouville  equation  given  by 
Eq.(l).  In  applications  involving  large  numbers  of  molecules  the  dimensionality  of  the  partial  differential 
equation  is  so  large  that  numerical  solution  is  never  attempted.  Smaller  systems,  however,  can  be  of 
interest.  A  linear  triatomic  molecule  (with  heavy  enough  atoms  to  make  classical  mechanics 
approximately  valid)  can  be  modeled  with  four  variables  if  bending  and  coriolis  interactions  are  neglected. 
The  variables  consist  of  two  relative  spatial  variables  which  determine  the  separation  of  the  atoms  and 
the  two  associated  momenta.  If  the  two  new  variables  are  thought  of  as  internal  labelling  variables,  the 
computation  might  proceed  as  in  Fig.  I.  The  four  dimensional  problem  now  requires  each  processor  to 
perform  a  very  large  calculation.  To  reduce  each  processor's  labor  a  finer  partitioning  is  required  and  the 
situation  in  Figs.  3  or  4  is  considered.  The  ultimate  two  dimensional  partitioning  in  Fig.  4  may  perhaps 
be  over  elaborate  for  the  problem  originally  considered,  but  in  a  higher  dimensional  problem  it  may  be 
quite  viable.  Higher  dimensional  nearest  neighbor  networks  can  be  conceived  but  seem  unduly 
complicated,  particularly  for  dimension  greater  than  three.  Finally,  if  a  doubly  periodic  system  is 
considered  (two  periodic  space  variables  and  their  associated  momenta)  the  most  natural  choices  for  the 
communication  variables  are  the  two  momentum  variables,  if  a  two  dimensional  partitioning  as  in  Figs.  3 


or  4  is  employed. 


3.  Nearest  Neighbor  Communication,  the  Vlasov  Equation  in  Special 
Situations 

The  Vlasov  equation  is  an  approximate  description  of  the  behavior  of  a  system  of  free  electrons  which 
avoids  the  high  dimensionality  of  the  Liouville  equation.  The  equation  for  the  distribution  function 
f(r,v,t)  (in  the  usual  terminology  velocity  is  discussed  in  place  of  momentum)  is  4,8 


dt/dt  +  vvf  -  e/m  E(f ,t)  v  f  «  0 
•* 

(M) 

with 

E(r,t)  *=  -  V  *  M) 

•ae* 

(15) 

and 

V*  *  —  4xe{  /dv  f(r,v,t)  -  n^f)} 

(16) 

where  e  and  m  are  the  charge  and  mass  of  the  electron,  respectively  and  a^r)  is  the  charge  density  of  the 
positive  ions  which  are  assumed  to  be  so  slow  moving  with  respect  to  the  electrons  that  their  motion  may 
be  neglected.  The  analogy  to  the  Liouville  equation  can  be  made  by  identifying  V  in  Eq.  (1)  as  -ef  and 
taking  N  equal  to  one.  The  three  dimensional  spatial  gradient  operator  is  denoted  by  V  and  vv  is  the 
gradient  in  the  three  dimensional  velocity  subspace.  Another  difference  between  the  two  equations  is  that 
♦  represents  the  non  local  self  consistent  field  of  the  electrons  whereas  V  in  Eq.  (1)  is  strictly  local.  This 
feature  presents  both  the  numerical  difficulties  of  the  Vlasov  equation  and  a  nonlinear  aspect  which  gives 
the  equation  great  physical  interest. 

The  numerical  procedure  outlined  in  the  last  section  requires  an  expression  for  the  second  time 
derivative  of  f  and  thus  if  Eq.  (14)  is  differentiated  the  time  derivative  ultimately  of  #(r,t).  This  is 
obtained  by  differentiating  Eq.  (16)  and  using  Eq.  (14) 

V*#  «  .4 ire  y  Jv  f(r,v,t)dv  (17) 

where  the  term  involving  vv  has  been  integrated  assuming  f(r,v,t)  vanishes  for  large  velocity.  The  time 
propagation  procedure  can  thus  be  applied  provided  •  and  #  can  be  obtained  from  Eqs.  (16)  and  (17). 
These  Poisson  equations  cannot  in  general  be  solved  with  the  partitioning  procedures  discussed  in  the  last 
section,  and  form  a  specialised  area  of  research  in  themselves.  Similar  mathematical  systems  as 
represented  by  Eqs.  (14),  (15)  and  (16)  arise  in  fluid  dynamics* 

Situations  do  exist,  however,  which  can  be  adapted  to  the  methods  of  the  last  section.  If  f(r,v,t)  and 
E(r,t)  are  independent  of  spatial  coordinates  y  and  a  the  equations  can  be  simplified  to 

dt/di  +  v  dt/dx  +  E  dtjdv  -  0  (18) 

and 

8E/dx  —  [dv  f(x,v,t)  «  qj(x)  (19) 

where  the  components  of  velocity  vy  and  vg  have  been  integrated  out  of  the  problem  and  the  independent 
variables  as  well  as  the  densities  and  field  have  been  reparameterized  4.  Differentiating  Eq.(20)  with 
respect  to  time,  substituting  Eq.  (18)  and  using  the  fact  the  f  vanishes  for  large  velocity 

dE/dt  mm  -/°°  V  f(x,V,t)  dv  +  c(t) 


(20) 


where  c(t)  is  an  integration  constant.  If  the  ions  are  essentially  restricted  to  a  finite  region  which  contains 
tero  net  charge,  E  vanishes  at  infinity  (external  fields  are  assumed  to  be  absent)  and  c(t)  vanishes.  In 
periodic  situations  the  assumptions  are  less  clear,  but  it  seems  that  c(t)  can  be  chosen  to  be  zero,  also. 

The  working  system  of  equations  is  taken  to  be  Eqs.  (18)  and  (20)  (with  c(t)*=0).  It  is  obvious  from  Eq. 
(20)  that  since  v  is  involved  in  an  integral  over  its  entire  range,  v  cannot  be  a  communication  variable  in  a 
nearest  neighbor  communicating  partition  if  computations  are  to  proceed  simultaneously.  The  only 
partitioning  available  for  the  one  dimensional  Vlasov  equation  is  thus  of  the  kind  illustrated  by  Fig.  1. 
For  systems  periodic  in  x  there  is  no  choice  but  to  use  the  kind  of  partitioning  described  by  Fig.  2.  The 
only  new  complication  is  that  for  every  column  of  f(x,v,t),  x  and  t  fixed  and  v  ranging,  a  single  value  of 
E(x,t)  must  be  added. 

In  most  applications  so  far  the  system  actually  solved  consists  of  Eq.  (18)  and  (19).  The  reason  for  this 
choice  is  that  Eq.  (19)  involves  a  lower  moment  of  the  distribution  than  does  Eq.  (20)  and  is  thus  thought 
to  possess  more  desirable  numerical  properties.  Serial  computations  using  the  methods  of  the  previous 
section,  however,  have  been  performed  and  show  no  reason  for  preferring  one  to  the  other. 

The  ultimate  partitioning  for  the  one  dimensional  non  periodic  Vlasov  equation  is  shown  in  Fig.  5.  The 
data  transfer  areas  have  again  coalesced  completely  into  the  data  computational  area.  Each  different  area 
of  the  subdomain  contains  two  columns  of  data,  one  for  each  different  value  of  x.  Each  column  contains 
f(x,v,t)  for  all  values  of  v  and  one  value  of  E(x,t).  The  ultimate  partitioning  is  only  one  dimensional  since 
with  only  nearest  neighbor  data  transfer,  v  cannot  be  a  communication  variable. 

The  present  techniques  can  be  applied  to  another  problem  which  is  potentially  interesting.  If  a  laser 
beam  of  radial  symmetry  ionizes  a  column  of  gas  a  cylindrically  symmetric  distribution  of  ions  and 
electrons  will  result.  To  examine  this  the  Vlasov  equation  is  written  in  cylindrical  coordinates 

df/dt  +  vfdf/3r  +  {yji)dt/dt  +  yfdt/di  +  (v*/t2  +  Er)  dt /0vf 

-  (Vr/r  *  E4)dt/dy4  +  Etdt/dyg  =  0  (21) 

where  again  the  equation  has  been  reparameterized4  and  vr,v^,vt,Er,E^,EI  are  cartesian  components  of 
vectors  in  the  instantaneous  orientation  of  the  unit  vectors  in  the  cylindrical  coordinate  system.  The 
Poisson  Equation  for  the  electric  field  similarly  becomes 

l/r(3/«?r)rEr  +  l/r(<9/^)E.  +  (d/di) E,  -  /dvr  /dv,  /dvf(r,v,t) 

*  9i(r).  (22) 

Spatial  cylindrical  symmetry  implies  f  is  independent  of  ^  and  z  and  E^  and  E(  are  zero.  The  distribution 
fuuclion  cannot  be  independent  of  v^  and  v(  since  it  must  vanish  when  these  variables  are  infinite.  From 
the  form  of  Eq.  (21)  v(  cao  be  integrated  out  of  the  problem,  so  that  all  that  remains  is  f(r,vr,v^,t)  and 
Er(r). 

The  system  now  becomes 

dt/ft.  +  vr0f/<9r  +  (v,J/r*  +  Er)0f/0vf  -  (v#vf/r)  0f /^v#  =  0  (23) 

and 

(djd r)  r  Er  —  r(  /  dvf  /  dv#  f(r,vr,v^,t)  -  q,(r)}.  (24) 

The  current  equation  is  obtained  by  integrating  this  from  o  to  r,  requiring  Er  to  be  finite  at  r— 0, 
differentiating  with  respect  to  time,  inserting  Eq.(23)  and  integrating  by  parts,  so  that 


8 


8 

I." 


B 


8 


4 


aEr/at 


/  dvr  /  dv  .v  f(r,vf,v  ,,t). 
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(25) 


Finally,  Eq.  (23)  can  be  simplified  by  replacing  the  pair  of  variables  r,v^  by  the  pair  /Ju  where  (formally) 


(26) 


Using  the  chain  rule,  and  dropping  the  prime  from  the  resulting  equation  gives 


dt/dt  +  vr<9f/<9r  +  (L*/r*  +  Er)3f /3vf  -  0.  (27) 

The  'force*  which  appears  in  this  equation  is  the  'centrifugal  force*  plus  the  electrostatic  force.  The 
angular  momentum  is  L.  Since,  classically,  the  centrifugal  force  keeps  particles  away  from  the  axis,  for 
non  rero  values  of  L,  f(r,yr,v^,t)  is  required  to  be  zero  at  r  equal  to  zero. 


The  system  of  equations  to  be  solved  consists  of  Eqs.  (25)  and  (27).  Due  to  the  integral  in  Eq.  (25),  vf 
and  v^  (or  L)  can  only  be  internal  labelling  values.  Again  t  is  the  propagation  variable  and  r  is  the 
communication  variable.  A  linear  partitioning  as  in  Figs  (1)  or  (5)  is  the  only  choice.  The  columns  in 
these  figures  are  replaced  by  matrices  with  vf  and  (or  L)  varying  along  the  rows  and  columns  and  with 
each  matrix  is  associated  a  single  value  of  Er(r,t). 


It  is  interesting  at  this  point  to  compare  the  four  variable  Liouville  equation  discussed  in  the  last  section 
on  the  partitioning  given  in  Fig.  4  with  the  cylindrical  Vlasov  equation  just  discussed  on  the  partitioning 
given  in  Fig.  5.  These  configurations  represent  using  the  ultimate  partitioning  for  the  next  most  simple 
problem  in  the  new  context.  For  the  Liouville  equation  the  computational  area  of  the  simplest  problem 
consists  of  four  points.  In  the  new  context  each  of  these  points  is  replaced  by  a  matrix  whose  rows  and 
columns  can  be  thought  of  as  generated  by  the  two  new  variables.  For  the  Vlasov  equation  the 
computational  area  of  the  simpler  problem  consists  of  two  columns  which  are  generated  by  variation  in  a 
velocity  variable  (along  with  a  single  value  of  the  electric  field).  In  the  new  context  these  columns  are 
replaced  again  by  matrices  whose  elements  are  generated  by  varying  the  values  of  the  two  velocities.  In 
these  two  cases  the  matrices  are  roughly  comparable  in  6ize,  the  difference  being  that  there  are  four  in  the 
case  of  the  Liouville  equation  considered  and  two  in  the  case  of  the  cylindrical  Vlasov  equation.  The 
computational  labor  required  in  each  subdomain  is  thus  of  the  same  order  of  magnitude.  Again  the  reason 
for  these  numbers,  four  and  two,  is  because  the  high  order  finite  difference  schemes  given  by  Eq.  (7)  and 
(8)  are  being  considered.  If  a  three  point  scheme  was  employed  only  one  matrix  would  be  required. 


4.  Numerical  Organiiation  of  Laser  Propagation  Computations 

The  physical  situation  is  that  of  a  laser  pulse  of  radial  symmetry  perpendicular  to  its  direction  of 
propagation  passing  through  a  medium  containing  matter  which  the  laser  pulse  is  capable  of  exciting.  A 
simple  mathematical  model  is  given  by  the  three  equations  7,8,9 


{d/di  +  (l/c)a/at  -  ia(<5V3r*  +  (l/r  d/dr))E  -  bP, 

(28) 

(a/at  +  kp  +  io)P  -  aEW, 

(29) 

and 

(a/at  +  kwMw  -  w°)  =  7(ep*  +  e*p). 

(30) 

The  independent  variables  are  the  time  t  and  the  cylindrical  coordinates  z  and  r.  The  Pulse  is  propagated 
along  z  and  r  is  the  distance  (perpendicular  to  z)  from  the  pulse  center.  The  electric  field  due  to  the  laser 
pulse,  E,  the  polarization  of  the  medium,  P,  and  the  inversion  density,  W,  are  the  dependent  variables. 
Note  that  E  and  P  are  complex.  The  velocity  of  light  in  the  medium  is  c.  If  the  field  were  suddenly 
turned  off,  as  seen  from  Eqs.  (29)  and  (30)  P  and  W-W°  would  decay  with  decay  constants  kp  and  kw 


respectively,  and  W°  is  the  steady  state  value  of  W.  The  parameters  a,b,a,/9  sod  7  are  combinations  of 
other  physical  parameters  and  constants  of  the  system  (frequency,  t,  Planck’s  Constant  etc.)  which  have 
been  combined  together  for  simplicity.  Detailed  discussions  of  the  underlying  physical  phenomena  are 
available.  10,11  Instead  of  describing  the  rapidly  varying  electric  field  and  polarisation  themselves,  E  and  P 
actually  describe  the  assumed  slowly  varying  envelopes,  and  thus  some  second  derivatives  have  been 
neglected.  The  geometrical  nature  of  the  initial  conditions  requires  specialised  numerical  techniques. 

New  variables  r  and  q  are  defined  by 
r  ■»  t  -  i/c 

and  (31) 

n  *=  1 

From  the  chain  rule  it  is  seen  that 

a/at  =  a/ar 

and  (32) 

a/ai  +  (i/c)a/at  -  a/aq. 

The  laser  pulse  enters  the  medium  at  1  equals  tero  at  time  t  equals  zero  and  moves  with  velocity  c.  At  a 
given  point,  >,  r  is  thus  seen  to  be  the  time  after  the  'leading  edge*  of  the  pulse  has  passed.  A  fixed  value 
of  r  determines  the  path  in  the  t,s  plane  of  a  disturbance  due  to  the  laser  pulse.  The  derivative  along  the 
path  of  such  a  disturbance  is  d/dq.  The  transformation  does  not  change  the  form  of  Eqs.  (29)  and  (30),  so 
these  can  be  taken  over  unchanged,  provided  that  it  is  understood  the  time  is  measured  from  the  'leading 
edge*  of  the  disturbance.  The  remaining  equation,  Eq.  (28)  becomes 

5E/3q  -  i*{d?/dt2  +  1/r  d/dt)E  -  bP.  (33) 

The  initial  condition  for  E  is  prescribed  at  **=0,  (q«*0) 

E(o,r,t)  =  E0(r,t).  (35) 

The  medium  is  assumed  to  have  its  spatially  independent  steady  state  values  ahead  of  the  laser  pulse,  so 
that 

W(q,r,r)  —  WQ  as  r  -*  o 

and  (36) 

P(q.r.r)  —  P0  as  r  —  o. 

The  situation  is  depicted  schematically  in  Fig.  6,  neglecting  for  this  purpose  the  variable  r.  The  time 
history  of  the  laser  pulse  at  z=0  is  sketched  to  the  left  where  the  'leading  edge*  starts  at  t— 0.  The 
'leading  edge*  propagates  along  the  line  labelled  r=0  in  the  figure.  The  area  below  this  line  has  not  yet 
encountered  the  laser  pulse,  and  thus  the  initial  conditions  for  W  and  P  are  applied  on  the  skewed  line 
r=0.  The  boundary  condition  for  the  field  is  applied  at  z=0,  which  is  the  vertical  axis  to  the  left  labelled 
t=r(o).  The  boundary  of  the  space-time  region  is  indicated  by  the  diagonal  shading  attached  to  the  lines. 
The  derivative  d/dq  is  taken  along  the  skew  lines  in  the  figure  with  r  constant.  Thus  P  and  W  propagate 
in  the  vertical  direction,  starting  from  the  line  r  equals  zero,  and  E  propagates  in  the  skewed  direction 
(^constant)  starting  from  the  line  z  equals  zero. 

The  nonorthogonal  coordinate  system  can  be  simply  understood  by  considering  the  variation  in  E  along 
the  line  inconstant, 

AE,  «  (3E/AZ)  6Zr  +  (9E/dt)  frf  (37) 

where  the  subscript  r  indicates  the  special  nature  of  the  variation.  From  Eq.  (31) 

&r  —  1/c  6lr.  (38) 

Substituting  this  in  the  preceding  equation 
AE,  =  ( dE/dl  +  1/c  dE/dt)  Slt. 

In  view  of  Eq.  (32)  this  can  be  written 


(39) 


(40) 


AE, «  (dE/dij)  SZr. 

It  is  clear  from  this  expression  that 

Si  =  SZf  (41) 

as  is  obtained  from  Eq.  (31)  also.  It  might  have  been  supposed  that  the  arc  length  along  the  line 
inconstant  should  have  been  used,  but  this  is  seeu  not  to  be  the  case. 

The  numerical  solution  has  been  given  by  predictor-corrector* 1  and  semi-implicit  predictor-corrector  0 
techniques.  These  procedures  allow  a  single  step  in  E  along  ^constant,  and  a  single  step  in  P  and  W 
along  z=constant  to  be  taken.  These  two  independent  steps  must  end  at  the  same  point  if  the 
information  required  for  the  next  step  is  to  be  available.  The  nature  of  the  differential  equations  and  their 
boundary  conditions  have  thus  provided  certain  restrictions  in  the  sequence  in  which  the  solution  is 
generated.  Several  different  sequences  can  be  conceived,  but  not  all  are  suitable  for  partitioning  into 
semi-independent  computational  procedures.  A  useful  sequence  and  two  associated  partitionings  are 
shown  in  Figs.  7  and  8.  The  figures  are  shown  on  the  problem  space  shown  in  Fig.  6.  The  circles  indicate 
the  points  at  which  the  numerical  solution  is  generated  and  the  number  in  the  circle  indicates  the  6tep 
number  in  which  the  numerical  solution  is  acquired.  Each  circle  has  circles  containing  smaller  numbers 
(from  earlier  steps)  to  the  left  along  inconstant  and  below  along  inconstant.  Each  circle  containing  the 
same  number  can  be  computed  independently  of  all  the  other  circles  with  the  same  number,  and  therefore 
partitioning  can  be  performed  across  diagonals  containing  the  same  number.  Two  different  partitioning 
are  shown,  one  in  Fig.  7  to  generate  the  solution  behind  the  leading  edge  of  the  pulse,  the  other  in  Fig.  8 
to  generate  the  solution  neighboring  the  entrance  window  to  the  medium.  The  boundary  region  is  only 
shown  for  the  entire  problem  domain.  Only  the  new  values  generated  are  shown  in  the  interior 
subdomains,  their  boundary  conditions  and  data  transfers  are  omitted  for  simplicity.  In  order  to  achieve 
this  partitioning  the  computation  is  required  to  have  a  staggered  start.  In  the  first  step  computation 
takes  place  only  in  subdomain  one,  in  step  two  parallel  computations  take  place  in  subdomains  one  and 
two,  in  step  3,  computations  in  subdomains  one,  two,  and  three  and  so  on  until  all  subdomains  are  active. 

So  far  the  radial  variable  r  has  not  been  discussed  as  it  clearly  does  not  present  any  new  problems,  or 
enter  into  the  complexities  discussed  above.  The  radial  variable  clearly  is  not  a  propagation  variable.  If 
the  partitioning  indicated  in  Fig.  7  and  8  is  implemented  in  the  simple  manner  of  Figs  1  or  5,  r  is  just  an 
internal  labelling  variable  and  merely  turns  the  points  indicated  in  Figs  7  and  8  into  the  columns  of  F'igs  1 
or  5.  The  partitioning  can  be  further  refined,  however,  and  r  assigned  the  role  of  a  communication 
variable  to  produce  a  scheme  such  as  is  given  in  Figs  3  or  4. 

Finally  it  is  interesting  to  note  that  in  comparing  Figs.  7  and  8  the  roles  of  t  and  z  are  interchanged  in 
the  two  figures.  In  Fig.  7,  each  subdomain  contains  a  particular  value  of  r  which  remains  the  same  as  the 
solution  is  developed.  Each  successive  step  advances  the  subdomain  to  a  new  value  of  z  (or  q).  Thus  z  (or 
q)  may  be  said  to  be  the  propagation  variable  and  t  (or  r)  the  communication  variable.  Each  subdomain 
in  Fig.  8,  on  the  other  hand  contains  the  same  value  of  z  for  all  steps  and  is  advanced  in  steps  of  r  (or  t). 
Thus  in  Fig.  8  the  propagation  variable  is  t  (or  r)  and  the  communication  variable  is  z  (or  q). 


5.  Two  Dimensional  Diffusion 

Isotropic  diffusion  in  two  dimensions  is  described  by  the  equation 

dt/di  =  c^f/thc2  +  52f/3y2  (42) 

when  suitable  units  are  chosen  for  the  independent  variables.  The  dependent  variable  f  can  be  regarded  as 
the  concentration  of  some  substance  which  is  initially  present  in  some  non-uniform  distribution.  Only  the 
simplest  boundary  conditions,  that  for  all  time  f  has  some  fixed  value  on  a  rectangular  boundary  in  the 
x-y  plane,  will  be  considered. 
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If  an  explicit  numerical  approximation  such  as  was  employed  in  Sections  2  and  3  is  utilized,  the 
condition  for  numerical  stability  will  be 

D  At/(Ax\a  <  1  for  Ax  ~  Ay  (43) 

where  D  is  some  characteristic  numerical  constant.  This  condition  is  strikingly  different  than  the 
condition  for  first  order  systems  given  by  Eq.  (0).  Consider  the  case  in  which  a  numerical  result  has  been 
obtained  which  appears  physically  reasonable.  If  all  the  variable  spacings  are  shrunk  uniformly  the 
approximation  for  the  first  order  system  will  remain  stable  (if  initially  stable)  whereas  a  similar  kind  of 
approximation  for  a  second  order  system  may  become  quite  unstable.  In  such  an  event  a  perhaps 
unreasonably  small  value  of  At  may  be  required  for  computation.  An  unconditional  stable  explicit 
numerical  scheme  is  required  to  properly  handle  second  order  terms. 


The  variable  transformation 
S  -  (y+x)//2 

»  -  (y-*)/v/2  (44) 

is  first  carried  out.  This  leaves  the  partial  differential  equation  unchanged,  i.e. 

0f/0t  -  <?*f/<9s2  +  a^f/at*.  (45) 

The  two  dimensional  version  ,2,1*  of  a  scheme  originally  devised  by  Saul’Yev  14  is  employed  with  the  s 
and  t  variables.  An  apparently  more  accurate  version  of  the  procedure  is  available,  1&  as  well  as  another 
application  to  a  more  general  one  dimensional  problem  *•.  The  method  is  generally  called  the  alternating 
direction  explicit  procedure  (ADEP).  Two  different  forms  of  the  finite  difference  expression  are  given, 


/\t*  • 
Um,n 


m,n 


(At/2b*)($f*‘ 


m-l,n 


,  a£*< 

2  U*  +  um  , 

m,n  ra,B-l 


+  Um+t,B 


(46) 


and 


v  -V 

m,D  m,n 


(A./2  **K*ilu 


T  Tm+l,o  *  vm,n  *  ’m,n+l  / 

(47) 

where 

t  —  /At 

s  ■*  n  h  y/2 

t  =  m  h  y/2  . 

(48) 

The  finite  difference  expression  for  u  furnishes  an  approximation  for  f  at  the  gridpoint  m,n  in  terms  of 
values  of  m-l,o  and  m,n-l  for  the  same  time  level  plus  values  from  the  previous  time  level.  Thus  for  a 
given  time  level  f  is  generated  starting  from  the  lower  left  hand  corner.  The  situation  is  reversed  for 
v.  Taken  together,  the  two  equations  furnish  a  means  of  generating  an  approximation  to  f,  first  in  one 
direction,  then  in  the  opposite  direction.  Two  different  means  of  implementing  the  approximation  are  in 
use.  The  first  is  an  averaging  procedure. 


v 

m,n 


(49) 


and 


m,n 


'/2(V. 


The  second  implementation  merely  takes  alternate  directions  in  subsequent  steps, 


A  tel  At+I 

v  «  t| 
m,n  m,n 


i+X  Ait2. 
‘m.n  “  Vm,n  ' 


(50) 


The  procedure  is  adapted  for  partitioning  by  transforming  the  expression  back  to  the  original  variables  x 
and  y.  Substituting  s  and  t  from  Gq.  (48)  into  Eq.  (44)  and  solving  for  x  and  y  gives 
x  *  jh 


ja 

ii 

>» 

(51) 

where 

j  =  n-m 

and 

(52) 

k  **  n+m  . 

For  these  variables 

'it1  *  ujk  -  (At/2h2)(uftk.i  -  2  4k  +  “AV.k-l 

+  u/l.k+I  *  2  ujfk  +  uAl,k+l)  (53) 

and 

v'+>  -  v*  -  (At/2h*Kv/,,k.i  *  2  »jfk  +  »A,,k.i 

+ v/r,Ui  * 2  vi,k<+  vAu+i)  •  <54) 

These  last  expressions  allow  the  calculations  for  the  new  time  level  to  be  carried  out  in  a  vertical  rather 
than  diagonal  direction.  The  procedure  is  shown  schematically  in  Figure  9.  In  this  figure  each  grid  point 
in  the  finite  difference  scheme  is  in  the  center  of  a  small  square.  Only  half  of  the  grid  points  are  coupled 
together  by  the  finite  difference  expressions  (Eqs.  (53)  and  (54)),  so  that  f  need  only  be  found  at  half  the 
grid  points.  The  grid  points  omitted  from  the  calculation  are  located  in  the  shaded  small  squares. 
Computation  of  u  from  Eq.  (53)  proceeds  from  the  bottom  of  the  figure  to  the  top.  One  particular  step  in 
the  computation  of  u  is  shown  at  the  bottom  of  the  figure.  The  grid  points  corresponding  to  the  l+l*‘ 
time  level  in  Eq.  (53)  are  indicated  by  the  open  circles.  The  grid  points  corresponding  to  the  Ith  level  are 
indicated  by  solid  circles.  The  calculation  proceeds  in  the  direction  indicated  by  the  arrow  once  a 
horizontal  row  has  been  completed.  The  computation  of  v  as  given  by  Eq.  (54)  is  similarly  shown  at  the 
top  of  the  figure.  This  computation  proceeds  from  the  top  to  the  bottom  of  the  figure. 

The  corresponding  partitioning  and  communication  for  parallel  computation  is  shown  in  Figure  10.  The 
situation  is  quite  similar  to  that  of  Figure  1  except  that  in  the  present  case  data  is  transferred  at  the 

completion  of  the  calculation  of  each  horizontal  row  rather  than  at  the  end  of  the  calculation  of  the 

particular  time  level  of  the  entire  domain.  The  data  transferred  in  the  present  case  is  only  that  of  a  single 
grid  point  instead  of  the  transfer  of  an  entire  vertical  column  in  Figure  1.  The  domain  border  points 
containing  boundary  values  for  the  domain  computations  are  located  in  the  crosshatched  area  (except  for 
the  two  rows  shown  in  more  detail).  Two  rows  of  the  finite  difference  grid  are  shown  in  the  middle  of 
each  domain.  The  grid  points  omitted  from  the  calculation  (the  shaded  small  squares  in  Figure  9)  are 

located  in  the  stippled  squares.  The  movement  of  newly  computed  data  to  the  boundary  of  the 


neighboring  domain  (to  supply  information  needed  to  compute  the  next  row)  is  shown  by  arrows. 


The  time  variable  (index  l  in  Eqs.  (53)  and  (54))  is  a  propagation  variable  in  this  scheme.  The  index  j 
corresponds  to  communication  and  k  to  internal  labelling.  In  view  of  the  way  the  computation  proceeds 
by  horizontal  rows,  k  can  be  also  regarded  as  an  internal  propagation  variable.  The  independent  variables 
x  and  y  are  equivalent  in  Eq.  (42),  and  essentially  equivalent  in  Eq.  (2).  In  the  latter  case  a  two 
dimensional  partitioning  is  possible.  Only  a  one  dimensional  partitioning,  by  contrast,  is  possible  in  the 
present  case  due  to  the  numerical  procedure. 
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7:7  Figure  4.  Same  as  Figure  3  Except  That  Each 

Computation  Consists  Only  of  4  Points,  the 
Minimum  Number,  and  That  the  Data  Transfer 
J  Blocks  Completely  Overlap 
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Figure  7.  Partitioning  in 
Pulse  Propanation 


Figure  9.  ADEP  Finite  Difference 
Scheme  for  the  Two  Dimensional 
Diffusion  Problem.  Solid  Circles 
Are  the  Points  used  at  the  Old 
Time  Level,  Open  Circles  at  the 
New  Time  Level.  Arrows  Indicate 
the  Order  of  Computation. 
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Figure  10.  Parallel  Processor 
of  the  Two  Dimensional  Diffusio 
Method.  Data  Movement  is  Shown 
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with  common  memory,  such  as  a  CDC  CYDER  205  or  a  Cray  Research  CRAY-1,  The 
algorithms  developed  all  have  the  property  of  linear  speed-up  with  the  number 
of  processing  elements.  The  algorithms  developed  include  sorting,  merging, 
selection  among  sets,  set  interconnection,  set  difference,  subset  testing, 
and  string  matching. 
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with  Applications 
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Abstract -This  paper  propotes  a  model  or  paradigm  for  the  develop¬ 
ment  of  parallel  algorithms,  gives  an  example  of  the  proposed  paradigm, 
and  displays  algorithms  developed  by  application  of  the  technique.  The 
algorithm  for  the  merge  of  two  ordered  lists  developed  through  applica¬ 
tion  of  this  technique  is  thought  to  be  original.  The  paradigm  proposed 
is  to  create  composite  unit  operations  which  combine  data  movement 
between  data  structures  with  a  conventional  operation  such  as  compare 
os  add.  The  composite  operation  constructed  for  this  study  is  baaed 
upon  partitioning  the  data  elements  into  two  linear  lists.  Exchange  of 
data  between  adjacent  dements  in  each  list  are  then  combined  with 
compares  and  adds  to  complete  the  composite  operations.  This  com¬ 
posite  operation  can  be  implemented  on  at  least  the  following  compu¬ 
tational  architectures. 

1)  SIMD  with  all  processors  sharing  a  common  memory. 

2)  SIMD  with  local  memories  and  a  linear  interconnection  (circular 
pipeline  or  ting  network  of  processors). 

3)  Vector  processor  with  common  memory,  such  as  a  CDC  CYBER 
205  or  a  Cray  Research  CRAY-1. 

The  algorithms  developed  all  have  the  property  of  linear  speed-up 
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with  tiie  number  of  processing  elements.  The  algorithms  devdoped 
include  sorting,  merging,  selection  among  sets,  set  interconnection,  set 
difference,  subset  testing,  and  string  matching. 

Index  Terms  -Data  structures,  parallel  algorithms,  set  processing  algo¬ 
rithms,  sorting  and  merging. 

I.  Introduction 

HERE  is  substantial  theoretical  and  practical  interest  in 
the  development  of  parallel  algorithms  for  the  common 
processes  of  numeric  computing  and  data  processing  [3J-[5] , 
[9] .  There  is,  accordingly,  a  need  for  models  and  paradigms 
for  the  development  of  parallel  algorithms  akin  to  the  tradi¬ 
tional  general  techniques  which  have  been  used  for  sequential 
algorithms  [10] . 

Lint  and  Agarwala  [II]  survey  and  analyte  the  models  used 
in  formation  of  parallel  algorithms.  All  of  the  models  for 
synchronous  parallel  algorithms  are  based  on  direct  abstraction 
of  the  physical  structure  of  the  system  such  as  common  shared 
memory  or  specification  of  an  interconnection  network:  ‘‘sort¬ 
ing  on  a  mesh-connected  computer”  or  ‘‘Bitonic  Sort  on  Dliac 
IV,”  etc.  This  focus  on  a  specific  architecture  does  not  generate 
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insight  into  the  logical  structure  of  the  algorithm  itself  and 
makes  transfer  to  other  architectures  difficult. 

Models  for  asynchronous  parallelism  are  typically  very  ab¬ 
stract:  directed  graphs,  Petri  nets.etc.  This  level  of  abstraction, 
while  extremely  powerful  for  structuring  process  interactions, 
often  hides  the  actual  computations  of  an  algorithm  and  makes 
meaningful  analysis  of  execution  times  difficult. 

There  is  a  need  for  models  of  parallel  processing  which  are 
abstract  enough  to  be  generalizable  to  several  architectures, 
but  which  are  sufficiently  descriptive  to  allow  ready  determi¬ 
nation  of  execution  costs.  Sequential  algorithms  are  typically 
modeled  as  operations  on  data  structures.  This  model  is  in¬ 
adequate  for  parallel  processing  because  it  does  not  consider 
interprocess  communication. 

Guidance  for  the  development  of  a  paradigm  may  be  obtained 
from  consideration  of  the  characteristics  of  models  of  parallel 
processing.  Models  of  parallel  processing  must  include  costs  of 
synchronization  and  interprocess  communication  as  well  as 
operation  counts  for  the  computational  steps  of  the  algorithm. 
The  paradigm  proposed  here  is  to  create  composite  unit  opera¬ 
tions  which  combine  data  movement  between  data  structures 
with  conventional  logical  or  arithmetic  operations  such  as 
compare  or  add.  The  composite  operation  is  applied  to  data 
values  specified  by  their  positions  in  a  data  structure.  This 
paper  defines  and  applies  one  example  of  this  paradigm  for  the 
development  of  parallel  algorithms.  The  composite  operation 
is  based  upon  partitioning  the  data  elements  into  two  linear 
lists.  The  composite  operations  are  defined  by  combining  add 
and  compare  operations  with  a  data  movement  operation  of 
value  exchange  between  adjacent  elements  of  each  list  and 
identically  indexed  elements  in  the  separate  lists.  The  subse¬ 
quent  section  defines  and  characterizes  the  algorithm  and  its 
cost  of  execution.  The  analysis  of  execution  cost  is  greatly 
simplified  by  the  use  of  the  composite  operation  since  data 
movement  and  synchronization  costs  are  included  in  the 
operations  count  analysis.  Algorithms  derived  with  the  para¬ 
digm  are  given  in  Section  HI.  The  characteristics  and  architec¬ 
tural  requirements  of  the  algorithm  are  given  in  Section'  IV. 
The  paradigm  will  be  contrasted  to  the  more  usual  approach  of 
algorithm  formulation  for  specific  parallel  architectures.  Defi¬ 
nition  of  data  movement  as  occurring  between  data  structures 
allows  for  mapping  to  a  spectrum  of  parallel  architectures  or 
interconnection  structures. 

II.  Description  of  the  Algorithm 

The  data  structures  used  are  linear  lists.  Linear  lists  A  and  B 
contain  the  data  elements  upon  which  the  algorithm  will  be 
executed  and  between  which  data  movement  will  be  defined. 
It  will  be  necessary  to  occasionally  employ  auxiliary  linear  lists 
which  we  will  note  by  C,  FLAG,  etc.  These  auxiliary  variables 
will  be  used  to  hold  intermediate  results  for  some  algorithms. 

We  first  write  the  algorithm  in  the  simple  case  where  the 
number  of  elements  in  each  data  array  is  n  and  the  number  of 
processors  is  n.  We  then  give  the  algorithm  in  general  form 
where  A  and  B  may  be  of  different  dimension  and  where  the 
number  of  processors  k  may  be  less  than  the  dimensions  of 
A  or  B.  Proofs  of  linear  speed-up  and  bounds  for  computation 
time  are  then  given  for  this  general  case. 


A  conventional  notation  is  used.  The  computational  steps 
within  cobegin-coend  brackets  are  processed  in  parallel  with 
the  degree  of  parallelism  specified  by  the  range  of  the  index 
of  the  cobegin.  For  example,  the  notation 

cobegin:  /  <0:n-l> 
operation  (A  [i]  ,/?[»]) 

coend 

implies  n  parallel  executions  of  operation  ( A  [/] ,  B[i) )  on  each 
of  the  elements  <0  <  i  <  n- 1>  of  A  and  B.  Processors  are 
assigned  a  number  from  0  to  n-1  (or  from  0  to  it- 1)  and  the 
processor  of  index  i  acts  upon  the  data  elements  of  index  i. 

Algorithm  (for  n  processors  and  n  elements  in  each  linear 
lists): 

begin 

for  /  *  0  to  if- 1  do 
cobegin:/ <0:n- 1> 

operation  (A  [f]  ,B\j],  C\j J ) 
coend 

end. 

Each  execution  of  the  cobegin-coend  bracket  executes  simul¬ 
taneously  the  as  yet  undefined  composite  arithmetic/logical 
plus  data  movement  operation  on  the  three  tuple  A\j),B\j], 
C\j ] .  This  is  then  repeated  n  times  to  complete  the  execution 
of  the  algorithm. 

In  order  to  conveniently  display  that  linear  speed-up  in  the 
number  of  processors  is  obtained,  we  write  the  algorithm 
where  A  and  B  are  of  size  n  and  m,  respectively,  and  k  pro¬ 
cessors  are  used.  We  restrict  the  number  of  processors  k  to 
the  situation  where  m=ck  and  c  is  integer. 

Algorithm  (k  processors,  m  -  ck,m<n ): 

begin 

for  i  -  0  to  n- 1  do 

for/  *  0  to  m- 1  in  steps  of  k  do 
cobegin:  l<0:k-l> 
operation  (A  [/] ,  B\l\ ,  C[l\ ) 
coend 

end. 

All  algorithms  expressible  in  this  form  show  speed-up  propor¬ 
tional  to  the  number  of  processors  if  we  neglect  load  time  and 
output  time,  as  is  commonly  the  case  at  this  level  of  resolution 
of  detail. 

Lemma  1:  Using  k  processors  the  algorithm  uses  Q(mn/k) 
steps. 

Proof:  Let  m=ck  for  some  integer  c.  The  inner  loop  takes 
0(m/k)  steps.  Therefore  the  outer  loop  takes  0(mn/k)  steps. 

Definition:  Let  \A  I  and  \B\  denote  the  sizes  of  the  arrays  A 
and  B,  respectively.  Let  min(|A|,  |B|)and  max(Ml.  |fl|)  de¬ 
note  the  minimum  and  maximum  of  the  sizes  of  the  arrays 
A  and  B,  respectively. 

Theorem  1:  Using  min(lA|,  |fi|)  processors  the  algorithm 
uses  max(H  I,  l#l)  steps. 

Proof:  m=\B\,  n=\A  |  and  m<n.  Therefore,  m=min(|>4 1, 
|£|)  and  n=max(|A|,  |ff|).  Substituting  k-m  in  Lemma  1 ,  the 
theorem  follows. 
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III.  Application 

The  utility  of  the  technique  is  illustrated  by  developing  algo¬ 
rithms  for  sorting,  merging,  selection,  set -problems,  and  string¬ 
matching.  In  all  the  algorithms  we  use  min(|A|,  |fl|)  processors 
in  order  to  simplify  the  presentation.  Extension  to  the  case 
where  the  number  of  processors,/!:,  is  less  than  min(|A|,  Iff |)  is 
straightforward  so  long  as  m  =  ck  and  c  is  integer. 

A.  Sorting 

Sorting  of  a  sequence  of  elements  can  be  performed  by  rank¬ 
ing  each  element  in  the  sequence.  Rank  of  an  element  x  is 
defined  to  be  the  number  of  elements  greater  (less)  than  jc. 
Ranking  is  done  by  comparing  every  element  against  every 
other  element  in  the  sequence.  Let  A  and  B  be  two  arrays 
each  of  size  n,  each  containing  the  list  of  elements  to  be  sorted. 
Initially  C\i J  =  0  for  i  =0, 1 , 2  •  •  • . 

Sorting  Algorithm: 

begin 

for/'=0  ton-1  do 
cobegin: /<0:n-l> 

•f  A  [/]  >5 1/)  then 

C[f]  ■- C(/J  +1;  ►  =  operation 

A  [j]  =  A  [0  +  0  mod  »]  (A  [/]  ,B\j),  C\j\ ) 

coend 

end. 

The  is  a  data  transfer  between  the  two  list  positions.  The 
composite  operation  here  consists  of  comparison  between  two 
identically  indexed  elements  of  A  and  B,  incrementing  a  rank 
counter  C  and  circular  left-shift  of  the  elements  of  A.  This 
sorting  algorithm  correctly  ranks  all  the  elements  in  the  array 

B.  The  rank  of  identical  elements  will,  however,  be  the  same. 
Consequently,  completion  of  the  sort  by  moving  the  data 
elements  to  their  positions  by  rank  will  result  in  all  identical 
elements  being  moved  to  the  same  locations  in  array  A.  In 
order  to  avoid  this,  we  adopt  the  following  solution.  Identical 
elements  are  placed  in  consecutive  locations  in  the  final  array 
on  the  basis  of  their  relative  positions  in  the  array  B.  If  Bi  and 
Bj  are  two  identical  elements,  B,  follows  Bj  in  the  sorted  se¬ 
quence  if  i<j  and  vice  versa  if  Oj.  An  element  in  the  /th 
position  in  the  array  B  is  compared  against  elements  to  its  left 
after  n-  j  iterations  of  the  for  loop.  Processor  Pj  which  always 
compares  the  /'  element  in  array  B  with  every  other  element, 
updates  its  count  on  comparing  with  an  equal  element  only 
after  n-j  iterations. 

Sorting  Algorithm  (identical  list  elements): 

begin 

fori*0ton-l  do 
cobegin:  /  <0:n-l> 
if  (A  {/J  >B[f\) 
or  ((A  [/I  *  B\j] ) 
and  /  >  (n  -  /)) 
then 

C[i)  :-C|/|  +1; 

A 1/1  =  A((/  +  i)modn) 
coend 


B.  Selection 

Given  a  sequence  of  elements  ax ,  at ,  •  •  • ,  an ,  selection  in¬ 
volves  choosing  the  Arth  largest  (smallest)  element  in  the 
sequence.  By  ranking  the  elements,  selection  can  be  performed 
by  indexing.  The  technique  used  to  rank  elements  in  the 
sorting  algorithm  is  used  for  selection. 

C.  Merging 

Array  A  is  of  size  n  and  B  is  of  size  m.  Initially  A\  1]  < 
A[2)  <  -A[n]  and £[] ]  <B[2]  •  <B[m). 

Merging  A  Igorithm: 

begin 

for  i=0  to  n- 1  do 
cobegin:  i<0:m-l> 
if  A 1/1  >B[j)  then 
A\j\  **B\j]  then 
A[j]  =  A  [(>  +  /)  mod  n) 

coend 

end. 

The  “•**”  operator  interchanges  the  elements  of  two  lists. 

On  termination,  A  [1  ]  <  A  [  2]  <  •  •  •  <  A  [n]  <B[1]  <  • 

<  B[m] .  This  algorithm  has  not  been  previously  found  in  the 
literature  and  is  believed  to  be  a  new  algorithm  for  parallel 
merger  of  two  lists.  Proof  of  this  algorithm  is  given  in  [7] . 

D.  Set  Intersection 

The  two  sets  of  elements  are  contained  in  arrays  A  and  B. 
Array  A  is  of  size  n  and  array  B  of  size  m  (n> m).  Let  FLAG 
be  a  Boolean  array  of  size  m.  Initially,  all  the  values  in  the 
Boolean  array  are  false. 

Set-Intersection  Algorithm: 

begin 

for i  =0  ton- 1  do 
cobegin:  /  <0:m-l> 
if  A  [/)  ~  B\j]  then 
FLAG[/'J  *-  true; 

Aj/'+l]  «-A(/']  modn 
A  j/]  =  A  |(/'  +  i)  modn] 
coend 

end. 

On  termination  the  true  values  in  the  FLAG  array  correspond 
to  elements  in  the  array  B  that  are  also  contained  in  array  A . 

E  Set  Difference 

Arrays  A  and  B  represent  the  elements  in  the  two  sets  as  in 
Section  II1-D.  A  FLAG  and  BFLAG  are  two  Boolean  arrays. 
AFLAC  is  of  size  n  and  BFLAG  is  of  size  m  (n>m).  Initially, 
the  values  in  both  the  boolean  arrays  are  false. 

Set-Difference  A  Igorithm: 

begin 

for  i'=0  ton-1  do 
cobegin:  ;<0:m- 1> 
if  A  [/']  =*[/]  then 


end. 
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begin  A  FLAG  [/]  «-  true; 

5FLAG[/]  *•  true; 

A[j]  =  A  (0+0  mod  n] 
end 

coend 

end. 

On  termination,  the  false  values  in  A  FLAG  correspond  to 
elements  in  array  A  that  are  not  in  array  B,  and  the  false  values 
in  RFLAG  correspond  to  elements  in  array  B  that  are  not  in 
array  A. 

F.  Subset  Testing 

Arrays  A  and  B  denote  sets  of  elements  as  in  Section  IV-D 
and  III-D.  FLAG  is  a  boolean  array  of  size  m.  Initially,  all 
the  values  in  Care  false. 

Subset  Testing  Algorithm: 

begin 

for/=0  to/t-1  do 
cobegin:  /<0:m-l> 
if  A  If]  ~B[j]  then 
FLAG[/]  -*-true; 

A  [/I  [(/  +  «)  mod  n] 

coend 

end. 

On  termination,  if  all  the  values  in  FLAG  are  true,  then  B 
is  contained  in  A . 

G.  String-Matching 

List  A  holds  the  text  string  and  List  B  holds  the  pattern 
string.  String-matching  involves  determining  whether  the 
pattern  held  in  B  occurs  as  a  substring  of  the  text  held  in  A . 

The  subset-testing  algorithms  of  Section  III-F  can  be  used 
for  string-matching  with  a  slight  modification.  Instead  of  test¬ 
ing  the  FLAG  array  at  the  termination  of  the  algorithm,  the 
array  is  tested  after  every  iteration.  If  after  any  iteration  the 
value  of  all  elements  in  the  FLAG  array  are  true,  then  the 
pattern  held  in  B  occurs  as  a  substring  in  A . 

IV.  Analysis  of  the  Proposed  Paradigm 

The  usual  procedure  for  establishment  of  parallel  algorithms 
is  to  map  a  sequential  algorithm  to  some  specific  parallel  archi¬ 
tecture.  The  execution  cost  is  determined  by  separately  count¬ 
ing  operations  and  data  movement  costs.  The  technique  of 
defining  a  composite  operation  against  a  data  structure  both 
generalizes  this  approach  and  simplifies  execution  cost  analysis. 

The  composite  operation  in  these  examples  uses  only  the 
adjacent  elements  in  each  list  and  identically  indexed  elements 
between  the  two  lists.  This  composite  operation  can  be  realized 
in  many  commonly  considered  models  of  parallel  architectures, 
including  S1MD  with  common  memory,  a  circular  pipeline  of 
SIMD  processors  with  no  common  memory,  and  a  vector 
instruction  architecture.  Realization  on  SIMD  with  common 
memory  and  with  vector  instructions  is  obvious.  Realization 
on  a  circular  pipeline  can  be  realized  as  shown  in  the  schematic 
of  Fig.  1 . 


Processor  n-1  Processor  n-2  Processor  I 


Fig.  1.  Circular  pipeline  of  processors  architecture. 

Ra  and  Rb  are  registers  to  hold  the  values  of  lists  A  and  B. 
The  registers  of  each  processor  can  compare  and  exchange 
values.  The  processors  are  connected  by  one-way  pipes.  A 
“shift”  instruction  moves  the  values  from  RA  of  processor  / 
to  Ra  of  processor  /'+!  with  end-around  for  n-1  to  0.  This 
basic  structure  can  be  augmented  with  additional  registers  for 
each  processor  as  needed  for  FLAG,  etc.,  vectors.  This  simple 
and  regular  architecture  suggests  VLSI  implementation.  It 
should  be  noted  that  this  architecture  will  not  implement  the 
string  matching  algorithm  without  special  provision  for  the 
AND  of  the  Boolean  values  in  the  FLAG  vector  at  each  cycle. 

A  wide  spectrum  of  algorithms  can  be  constructed  by  com¬ 
posing  only  compare  and  add  with  the  very  simple  data  move¬ 
ments  among  so  simple  a  data  structure  as  a  linear  list.  Execu¬ 
tion  costs,  including  data  movement,  are  readily  determined 
for  a  number  of  different  architectures  for  each  algorithm. 
The  operations  and  algorithms  also  suggest  effective  architec¬ 
tures  for  given  problems.  Definition  of  composite  operations 
on  square  arrays  leads  to  formulation  of  graph  and  image 
processing  algorithms.  These  will  be  explained  in  a  later  paper. 

V.  Summary 

We  have  proposed  and  described  a  paradigm  for  the  design 
of  parallel  algorithms.  We  have  applied  the  technique  to  pro¬ 
duce  a  number  of  algorithms.  The  algorithms  produced  by 
the  application  of  the  paradigm  have  desirable  properties  such 
as  linear  speed-up  and  readily  obtainable  bounds.  The  tech¬ 
nique  produces  algorithms  which  are  applicable  to  a  wide 
variety  of  computational  architectures.  We  propose  that  the 
paradigm  of  specifying  composite  operations  against  data 
structures  is  a  basis  for  at  least  one  direction  of  study  of 
parallel  algorithms. 
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Abstract 

The  communication  costs  for  parallel  versions  of  two  aimpte 
algorithms  used  in  image  processing  are  compared  in  packet 
twitching  and  circuit  switching  formulations.  The  two  algorithms 
are  smoothing  and  bistogramming.  The  histogramming  algorithm, 
the  recursive  doubling  algorithm  of  Stone,  is  studied  over  a  range 
of  processor  numbers  and  pixel  intensity  resolution.  The  packet 
and  circuit  switching  properties  of  the  interconnection  networks  of 
the  multiprocessor  systems  are  based  on  two  network  architcetured 
multiprocessors  which  are  well-documented  in  the  literature.  PASM 
and  TRAC.  Communication  based  upon  circuit  switching  generally 
gives  a  somewhat  lower  communication  cost  with  the  advantage 
increasing  with  pixel  intensity  resolution.  The  results  of  the 
analysis  suggest  a  high  utility  value  for  including  both  circuit 
switching  and  packet  switching  functionality  in  the  networks  of 
network  architectured  multiprocessor  systems. 

Introduction  and  Overview 

This  paper  compares  the  communication  costs  for  executing  two 
algorithms  used  in  image  processing  on  three  parallel  computer 
architectures.  The  purpose  cf  the  comparison  is  to  evaluate  packet 
twitching  and  circuit  switching  modes  ot  data  movement  tot 
interconnection  network  based  multiprocessors.  The  two 
algorithms  used  for  the  comparison  are  computation  of  histograms 
of  the  intensity  values  of  pixels  of  an  image  and  smoothing  of  gray 
level  data  across  the  pixels  of  an  image. 

The  model  for  a  packet  switching  architecture  is  the  Partitionable 
S1MD/M1MD  (PASM)  System  for  Image  Processing  and  Pattern 
Recognition  |Siegcl8l|.  The  model  for  a  circuit  switching 
architecture  is  the  Texas  Reconfigurable  Array  Computer  (TRAC) 
|Sejnowski80],  The  third  architecture,  all  processors  sharing  a 
common  bus  |Bhuyan82j,  is  given  as  a  baseline  for  the  comparison. 
An  nnalysis  of  communication  costs  for  the  two  algorithms 
executing  on  PASM  has  been  given  in  [Sicgcl8lj.  The  resuits  of  an 
analysis  of  the  execution  of  the  two  algorithms  in  a  circuit 
(witching  formulation  based  on  TRAC  aie  given  here.  Space 
limitations  preclude  detailing  of  the  analysis. 

Communication  Analysis  for  Parallel  Algorithms 

The  major  factors  determining  communication  cost  for  the 
execution  of  parallel  algorithms  on  interconnection  network  (ICN) 
based  multiprocessors  include:  (i)  the  topology  of  the  ICN  and  the 
configuration  of  resources  on  the  ICN.  (ti)  the  mapping  of  the  data 
movement  requirements  of  the  algorithm  upon  the  ICN,  (iii)  choice 
ot  switching  methodology,  (iv)  the  latency  and  bandwidth 
properties  ol  the  ICN,  and  (v)  the  unit  sires  and  the  total  volume 
of  the  data  to  be  moved.  This  paper  focuses  on  the  impact  of 
twitching  methodology  nod  data  volume  on  communication  cost. 

The  choke  of  packet  switching  or  circuit  switching  as  the  mode  of 
■etwork  data  path  establishment  can  have  a  substantial  effect  on 
each  ot  these  architectural  parameters.  Packet  switching  tends  to 
give  flexibility  ia  topology  but  fixed  unit  transfer  sires.  Circuit 
•witching  tends  toward  less  flexibility  ia  tope  logy,  greater  flexibility 
ia  aait  site  for  transfers,  but  a  longer  Itansfrr  latency  time.  Packet 
•witching  may  also  introduce  bandwidth  degradation  due  to  path 
confection  while  circuit  switching  may  introduce  path  blockages 
whkh  limit  realisable  network  topologies  (or  all  networks  short  of 
full  cross-bar*. 


The  measure  of  communication  cost  is  elapsed  time.  The 
communication  times  given  herein  are  reported  as  number  of 
memory  cycles.  We  assume,  in  order  to  normalise  computation 
costs  across  architectures,  that  an  integer  addition  takes  a  single 
memory  cycle  and  that  updating  a  histogram  vector  <  Ici.icnt 
requires  two  integer  additions.  The  speed-up  of  a  multiprocessor 
over  a  uniprocessor  is  the  ratio  of  total  execution  times,  TE,  where 
TE  «•  TpQj^  +  TCOVJp.  All  LOG's  in  this  psper  are  in  base  2 
unless  otherwise  noted.  The  data  paths  of  each  ICN  are  taken  to 
be  one  integer  word  in  width.  For  the  multistage  ICN’s  of  PASM 
and  TRAC  it  is  assumed  that  a  unit  of  data  moves  through  one 
stage  of  the  ICN'  on  each  memory  cycle. 

Definition  of  Architectures 

Communication  cost  for  execution  of  the  two  algorithms  is 
compared  for  three  ICN-based  multiprocessor  architectures.  The 
single  shared  bus  architecture  (Figure  1)  has  been  characterized  by 
Bhuyan  and  Agrawal  |Bhuyan82j.  It  is  a  baseline  for  ICN-bnscd 
multiprocessors.  There  is  no  distinction  between  packet  and  circuit 
switching  in  this  model  of  communication.  The  model  for  a  packet 
switching  data  movement  architecture  is  PASM  [Siegeigrj.  The 
ICN  cf  PASM  connects  complete  processing  elements  as  shown  in 
Figure  2. 
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The  interconnection  artworks  proposed  for  PASM  are  the 
generalized  cube  and  the  augmented  data  manipulator  (ADM) 
|kiegel7#|.-  These  two  networks  are  optimal  for  histogramming  in 
the  tense  that  all  permutations  for  the  algorithm  can  be  realized  by 
both  networks  in  a  single  pass.  Thus  packet  transfers  can  take 
place  without  blocking. 

The  model  for  a  circuit  switching  data  movement  architecture  is 
TRAC  (SejnowskiM))  TRAC  places  processors  at  the  apex  nodes 
and  memories  at  the  base  nodes  of  its  ICN  (Figure  3).  The  1CN  of 
TRAC  is  an  SW-Baryan  |Prrmkuinar80)  with  nodes  having  spread 
of  two  and  fanout  of  three  for  its  IC'N.  Processor  configurations 
are  formed  by  establishing  circuits  in  the  ICN  joining  processors  to 
memory  units.  Data  flow  between  processors  for  different  stages  of 
the  algorithms  can  be  realized  by  dynamically  switching  memories 
between  processor-memory  configurations.  This  network  also 
implements  trees  of  circuits  joining  one  memory  to  many 
processors  in  which  any  one  circuit  may  be  activated  and/or 
deactivated  by  a  tingle  processor  instruction.  These  tree  circuits 
are  called  shied  or  switchable  memory  trees.  Data  flow  between 
processors  may  be  implemented  using  this  capability  by  a  sequence 
of  circuit  activations  and  deactWations  (among  the  circuits 
following  the  tree). 

The  ICN  of  TRAC  actually  implements  both  circuit  switching 
and  packet  switching  but  only  tbe  circuit  switching  mode  of  use  a 
modeled  in  the  equations  given  following. 

The  Algorithms  and  Their  Mapping  to  the  Architecture 

Histogramming  and  smoothing  are  among  the  basic  operations  of 
image  processing,  although  not  usually  rate  determining  steps  in 
the  computations.  Attention  to  detailed  parallel  formulations  of 
major  computational  steps  of  image  processing  such  as  thresholding 
and  edge  detection  is  needed.  It  is  assumed  in  the  analysis 
following  that  the  picture  is  M*M  pixels  in  size  (M=2m)  snd  that 
N  (N— 2°)  processors  are  available.  The  resolution  of  each  pixel  is 
X  bits. 

Histogranimiag  Algorithm 

The  parallel  algorithm  for  histogramming  is  the  recursive 
doubling  algorithm  of  Stone  [Stone75|.  The  structure  of  the 
nlgorithm  is  shown  in  Figure  4  for  N=8.  N  partial  histograms  are 
computed  in  parallel  at  level  0.  Each  partial  histogram  is  n  vector 
of  length  2k.  The  partial  histogram:  are  then  added  in  pairs  in 
parallel  for  LOG(N)  stages  to  complete  the  algorithm. 


£2  level  I 


Figure  4:  Recursive  Doubling  Algorithm 
for  Histogramming 

Partial  histograms  are  shown  at  level  0  by  A't  and  vector  additions 
by  B'a.  N/2’  transfers  of  vectors  ate  done  between  level  (i-1)  and 
level  I.  The  computation  time,  T^^jp,  for  this  algorithm  under 

(be  asaumptions  made  here  is  proportional  to  TCOMp  —  M*/N  + 
2*  LOG(N). 


A  Packet  Switching  Formulation  of  Recursive  Doubling 
Histogramming  •  Siegel  cf  a  I  jbicgrISiP hive  given  a  thorough 
analysis  lor  the  execution  of  this  algorithm  oo  PASM.  We  adopt 
the  results  of  this  study  as  our  packet  switching  model  of  recursive 
doubling  histogramming.  It  is  commonly  the  ease  that  further 
steps  in  the  analysts  of  the  image  require  thresholding  so  that  tbe 
final  histogram  vector  must  be  collected  in  one  processor  and  the 
threshold  value  dislnbuted.  The  total  communication  time  for  this 
formulation  of  the  nlgorithm  in 


Recursive  Doubling 


Tcomw  “  l(LOG(N’)  +  2X)  +  2|  *  LOG(N). 

travel  time  switch  number  of  levels 
through  the  setting  in  the  ICN 
ICN  time 

A  Circuit  Switch  Formulation  of  Recursive  Doubling 
Histogramming  Based  on  Tree  Circuits  -  Figure  5  illustrates  the 
structure  of  tbe  circuit  swiTcEed  data  movement  formulation  of 
recursive  doubling  for  an  8  processor-8  memory  configuration. 
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Figure  5:  Circuit  Switching  Using  the 
Tree  Circuit  Formulation 


The  M*  pixels  are  evenly  partitioned  among  the  8  memories.  Each 
processor  computes  n  partial  histogram  vector  and  stores  it  in  the 
corresponding  memory.  The  computation  is  then  completed  in 
LOG(8)=3  stag**  of  adding  partial  vectors  with  the  full  histogram 
computed  by  processor  3  and  stored  in  memory  5.  The  tree 
circuits  of  Figure  5  implement  tbe  successive  communication  paths 

between  levels  in  Figure  5.  The  ■ - *  tree  circuits  implement  the 

data  flow  between  levels  0  and  1  in  Figure  4,  *ooooo*  the  data  flow 

between  levels  1  and  2  and  • *  the  data  flow  between  levels  2 

and  3.  There  is  n  regular  pattern  of  using  first  the  verticals  and 
then  the  diagonals  of  each  type  of  tree  circuit.  Each  tree  circuit 
type  has  a  unique  number  (called  COLOR  in  correspondence  with 
graph  theory).  LOG(N)  colors  are  required  to  implement  the 
nlgorithm  in  this  formulation.  Path  selection  (activation  and 
deactivation)  in  all  tree  circuits  of  identical  COLOR  can  be  done  in 
parallel  with  a  latency  lime  proportional  to  LOG(N)/2.  The  ICN 
of  TRAC  can  implement  the  tree  circuit  pattern  of  Figure  5 
without  blockage.  Tbe  total  communication  cost  for  this 
formulation  of  recursive  doubling  bisla*  amming  is 

LOC(N) 

Toomm  “  |LOG(N)]->  r  N/21  [LOG(N)  +  LOG(N)) 

•w  (3/2)(N-l)  time  to  latency 

switch  nil  time 
memory  with 
identical  COLOR 


A  Circuit  Switching  Formulation  of  Recursive  Doubling  Based  on 
Hirer t  Hrronfiguraiion  -  Another  formulation  based  on  circuit 
switcliiug  cao  be  developed  by  dirrrtly  reconfiguring  the  ICN  after 
each  step  (level  in  Figure  4)  of  tbe  algorithm  to  conform  to  the 
data  movement  path  required  at  each  stage  of  the  algorithm.  Each 
configuration  step  involving  establishment  of  n  circuit  between  a 
given  processor  and  a  set  of  memories  must  be  done  serially.  Thus 
ase  of  the  tree-circuit  based  algorithm  is  faster  by  a  factor  of 
LOG(K)  where  K  is  the  number  of  COLORS  available. 

The  Smoothing  Algorithm 

Smoothing  is  replacement  of  the  intensity  of  each  pixel  by  the 
mean  of  the  intensity  of  the  given  pixel  and  its  nearest  neighbors. 

Packet  Switching  Formulation  of  Smoothing  -  Siegel  rt  of 

1SicgH8l|  have  formulated  ami  analyzed  a  packet  data  movement 
ormulation  of  the  smoothing  algorithm.  They  show  n  speed-up  of 


about  .8N  for  a  1021  processor  configuration  This  estimate  ia 
extremely  conservatively  based.  A  greater  speed-up  is  probable. 

Circuit  Switching  Formulations  of  Smoothing  -  A  circuit 
switching  structure  for  the  smoothing  operation  is  suggested  by  the 
fact  that  each  computation  requires  only  nearest  neighbors. 
Therefore  if  the  pixels  are  stored  by  columns  then  a  processor  will 
need  simultaneous  access  to  three  columns  (say  k-l,k,k+l)  to 
execute  the  computations  on  column  k.  A  realization  of  this 
representation  of  the  smoothing  algorithm  is  given  in  Figure  6. 
Extra  zero  valued  rows  and  columns  of  pixel  values  are  added  to 
each  formulation  of  boundary  conditions.  The  solid  lines  of  Figure 
0  are  normal  circuits.  The  dotted  lines  are  tree  circuits  from  which 
leaf-root  paths  can  be  selected.  Processor  1  computes  in  sequence 
the  smoothed  values  for  the  pixels  in  columns  1,  2  and  3.  Processor 
2  will  simultaneously  and  in  sequence  compute  the  smoothed  values 
for  the  pixels  in  columns  1,  S  and  6.  PI  and  P2  must  share  access 
to  pixel  columns  3  and  4.  The  execution  procedure  described 
preceding  allows  this  sharing  to  be  accomplished  without  conflict  it 
the  required  circuits  can  be  established  in  the  network.  This  two 
processor  configuration  obviously  extends  to  an  N  processor  3N- 
memory  configuration  so  long  as  the  memory  unit  can  hold  an 
entire  column  of  pixel  values.  A  TRAC-like  1CN  can  realize  these 
configurations  so  long  as  these  restrictions  are  met.  It  is  also  the 
case  that  the  necessary  data  movement  can  be  realized  by 
reconfiguration  of  normal  circuits.  This  is  not  the  method  of  choice 
so  long  as  the  conditions  for  a  tree  circuit  representation  can  be 
met. 
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It  may  be  desired  to  nse  a  degree  of  parallelism  greater  than  M 
(N>M).  Then  the  columns  of  pixels  must  be  decomposed  into 
vectors  of  length  M/k  where  N=kM.  In  this  case  the 
establishment  of  circuits  is  dependent  upon  k  and  may  not  always 
be  possible.  A  formulation  using  both  circuit  switching  and  packet 
rooting  capabilities  for  TRAC  has  been  worked  out.  The  pixels 
appearing  at  the  boundaries  created  by  partitioning  of  columns 
have  their  ’nearest  neighbors*  sent  to  them  by  packet  movement. 
This  ’mixed*  communication  mechanism  is  still  of  lower  cost  than 
a  pore  packet  based  mechanism.  The  case  N<M  (for  N— 2', 
m— 2*)  is  bandied  by  assigning  multiple  (2k)  columns  to  processors. 
This  case  raises  no  new  problems. 

We  give  here  numbers  only  for  the  circuit  switching 
representation  where  N— M  and  data  movement  is  via  tree  circuit 
activation  and  deactivation.  Then  the  total  communication  cost  is 
(N/2)  LOG(N)  (assuming  deactivation  and  activation  of  all  tree 
circuit  paths  is  done  in  parallel).  If  N=M— 512,  then  only 
256*9—2304  memory  cycles  are  required  for  data  communication. 


This  ia  negligible  compared  to  the  0*512*512  arithmetic  operations 
on  the  pixels  (C>_10  snd  probably  C>!0:)  since  indexing  must  be 
accomplished  as  well  as  the  addition  and  division  of  smoothing 
itself. 

We  thus  conclude  that  for  smoothing  data  movement  costs  will 
be  essentially  trivial  for  both  packet  and  circuit  switching 
representations. 

Sr>eed-up  Analysis  and  Discussion 

Figure  7  shows  the  net  speed-up  versus  the  number  of  processors 
for  M=1024  and  X=8.  There  is,  in  this  case,  little  difference 
between  formulations  based  on  different  switching  strategies  for 
moderate  numbers  of  processors.  There  is  the  suggestion  that 
circuit  switching  will  yield  superior  pertormance  for  large  numbers 
of  processors. 

Figure  8  shows  the  speed-up  factor  as  a  function  of  X  for 
M=1024,  and  N=25G.  The  amount  of  data  transferred  grown 
exponentially  as  X.  Thus  circuit  switching  data  movement  shows  a 
strong  advantage  as  X  increases  since  the  cost  of  data  movement  in 
the  circuit  switching  strategy  given  here  is  constant  with  respect  to 
data  volume  until  the  capacity  of  a  memory  unit  is  exceeded. 

Smoothing  on  the  other  hand  shows  advantage  for  packet 
switching  since  there  are  cases  where  a  pure  circuit  switching 
formulation  becomes  rather  complex. 

The  bottom  line  with  respect  to  parallel  histogramming  is  that 
circuit  switching  has  an  advantage  resulting  from  flexibility  in  the 
unit  size  of  transfers  and  in  stability  with  respect  to  algorithm 
parameters  but  that  well-designed  architectures  should  give  similar 
performance  for  small  to  moderate  numbers  of  processors. 

Circuit  switching  and  packet  switching  are  both  extremely 
efficient  for  parallel  smoothing.  Packet  switching  has  an  advantage 
over  circuit  switching  with  respect  to  application  of  degrees  of 
parallelism  with  N>M  for  parallel  smoothing.  This  advantage 
atiaea  from  the  greater  flexibility  in  communication  topology. 
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Abstract 

This  paper  defines  one  set  of  requirements  for 
a  successful  general  purpose  parallel 
architecture,  describes  the  design  concepts  of  the 
Texas  Reconf i gurable  Array  Computer  (TRAC)  and 
then  demonstrates  that  the  TRAC  architecture 
fulfills  these  requirements.  It  will  be  seen  that 
TRAC  implements  a  general  purpose  parallel 
computation  system  through  its  ability  to 
implement  a  spectrum  of  single  purpose 
architectures.  Special  attention  is  paid  to 
architectural  support  for  software  and  to  the  I/O 
problems  for  a  many-processor  architecture. 


Requirements  for  Effective 
Parallel  Computing 

A  parallel  computer  architecture  is  founded 
upon  a  set  of  assumptions  about  the  nature  of  the 
workload  it  is  to  execute  and  the  technology 
available  for  implementation.  These  assumptions 
in  turn  define  the  requirements  for  the 
architecture.  This  section  lists  the  assumptions 
and  requirements  which  underlie  the  design  of  the 
Texas  Reconfigurable  Array  Computer  (TRAC). 


1.  An  effective  general  purpose  parallel 
architecture  must  be  able  to  realize  a 
spectrum  of  models  of  parallel 
computation.  (See  Section  2  for 
definition  of  models  of  parallel 
computation.)  In  many  cases  efficient 
parallel  formulation  of  most 
significant  problems  will  require 
multiple  modes  of  parallel  computation. 

2.  High  performance  communication  between 
parallel  processes  is  as  important  as 
high  performance  computations. 

3.  A  successful  many  processor 
architecture  must  include  an  effective 
solution  to  the  problem  of  distributing 
data  from  I/O. 

A.  Software  will  be  the  major  bottleneck 
in  the  application  of  parallel 
computing. 


concept  base  which  is  extensible  across 
a  broad  range  of  numbers  of  processors. 

The  validity  of  the  arguments  that  TRAC  can  be 
the  basis  for  a  very  high  performance  parallel 
architecture  depends  on  the  validity  of  this 

requirements  analysis. 

The  sections  of  this  paper  which  follow  will 
show  that  TRAC  offers  one  total  system 
architecture  which  meets  the  requirements  of  the 
problem  domain  defined  by  this  set  of  assumptions. 

Hodels  of  Computation 

It  is  useful,  before  going  on  to  the  discussion 
of  TRAC  design  principles,  to  give  our  perspective 
on  parallel  computing.  Problems,  algorithms, 
architectures  and  languages  can  all  be 

characterized  by  the  model  of  computation  which 
they  require  for  execution  or  which  they  realize. 

The  essential  element  in  the  understanding  of 
the  development  of  algorithm,  software  and 
applications  for  parallel  architectured  computer 
systems  is  an  understanding  of  the  models  of 
computation  required  by  significant  problems  and 
realized  on  hardware  architectures.  A  model  of 
computation  for  sequential  computing  includes  the 
following  elements: 

1.  primitive  units  of  computation 

2.  composition  rules  for  composing  the 
primitive  units  of  computation  into 
executable  and  schedulable  units 

3.  definition  of  address  spaces  which 
control  the  data  '  to  which  the 
computation  is  applied. 

Models  of  parallel  computation  add  to  the  elements 
Of  sequential  computation  requirements  for: 

1.  modes  and  topology  of  communication 
between  units  of  computation  which  are 
executing  in  parallel  and 

2.  modes  and  types  of  synchronization 
mechanisms. 


5.  A  successful  family  of  parallel 
architectures  will  be  founded  on  a 
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An  application  is  typically  made  up  of 
algorithms  and  steps  and  phases  within  algorithms, 
each  of  which  may  be  characterized  by  some  model 
of  computation.  Units  of  computation  and  the 
modes  and  topology  of  communication  are  often 
disjoint  between  phases  of  an  algorithm.  These 
algorithms  and  applications  must  be  mapped  to  a 
hardware  architecture  which  also  realizes  one  or 
more  models  of  parallel  computation  including  the 
specification  of  one  or  more  modes  of 
synchronization  and/or  topologies  and  modes  of 
communication. 

Design  Concepts 

This  section  outlines  the  design  concepts  which 
are  used  in  the  TRAC  system  to  meet  the 
requirements  posed  by  the  problem  statement 
defined  in  the  section  entitled  "Requirements  for 
Effective  Parallel  Computing."  The  set  of 
concepts  which  we  will  describe  include 
configurabil ity,  partitioning.  integrated  I/O 
architectur  e.  and  multiple  modes  of  communication. 

Configurability  is  the  ability  to  realize 
problem  specific  parallel  architectures  from  a 
collection  of  resources.  Partitioning  is  the 
separation  of  resource  configurations  which  gives 
conflict  free  execution.  An  integrated  I/O 
architecture  constructs  a  solution  for  external 
storage  as  a  part  of  the  design  of  the  memory 
architecture  of  the  system.  Multiple  mode3  of 
communication  refers  to  the  use  of  both  circuit 
switching  and  packet  switching  as  modes  of  data 
movement  through  the  components  of  the 
architecture. 

Configurability 

Configurability  in  TRAC  is  founded  on  the 
dynamic  establishment  of  circuits  in  an 
interconnection  network.  The  interconnection 
network  used  in  the  current  incarnation  of  TRAC  is 
a  two-sided  SW  Banyan  network  [GOK73,  UP80]. 
Processors  are  placed  at  the  apex  nodes  of  the 
■switch  and  memory-I/O  units  at  the  bases  of  the 
switch.  Figure  1  is  a  schematic  of  a  ^-processor 
9-*nemory-I/0  configuration  of  TRAC.  The 
architecture  of  TRAC  and  the  properties  of  the 
banyan  network  have  been  well  described  in  the 
literature  [SE  J80,  PRE80,  LIP80],  We  will  focus  on 
the  properties  which  result  from  this 
architecture. 

Three  types  of  circuits  are  implemented.  All 
have  the  shape  of  tree3.  The  trees  rooted  at  the 
apex  nodes  link  one  processor  to  many  memory-I/O 
nodes  and  realize  an  address  space  for  the 
processor.  There  are  two  types  of  trees  with 
roots  at  the  base  of  the  network.  An  instruction 
tree  couples  a  set  of  processors  to  give  an  S1MD 
architecture.  A  switchable  memory  tree 
establishes  a  family  of  circuits  potentially 
linking  many  processors  to  a  single  mcmory-I/0 
unit.  Only  one  circuit  in  this  family  may  be 
active  at  any  ,?iven  time.  Figure  2  illustrates 
the  formation  oi  such  circuits  for  the  ^-processor 
9-«eraory-I/0  configuration.  It  is  straightforward 


to  see  that  this  capability  for  configuring 
architectures  from  resources  through  the 
establishment  of  circuits  in  the  interconnection 
network  supports  realization  of  almost  arbitrary 
parallel  architectures. 

Partitioning 

Partitioning  is  again  founded  on  the 
establishment  of  circuits  in  the  network. 
Partitioning  follows  from  the  fact  that  a  given 
resource  element  interacts  on  a  cycle  by  cycle 
basis  only  with  the  resource  elements  to  which  it 
is  joined  by  active  circuits.  The  construction  of 
a  configuration  from  a  collection  of  resources  can 
thus  be  said  to  partition  these  resources. 

Integrated  I /0  Architectures 

The  memory  unit  at  each  base  node  of  the 
network  may  also  have  attached  a  self-managing 
secondary  memory  unit  (SKSK)  [RAT81.RATB3].  The 
SKSM  is  an  object-oriented  storage  device.  It 
implements  associative  searching  by  name  and  thus 
implements  a  local  directory  for  the  objects  it 
has  stored.  The  storage  element  for  a  SI1SM  may 
range  from  additional  RAM  to  a  large  disk.  An 
SHSM  extends  the  address  space  of  a  memory  unit  to 
include  the  objects  stored  in  the  SMSM. 

Multi  pie  Modes  of  Common) eat) on 

Efficient  parallel  formulations  of  problems 
and/or  algorithms  must  minimize  communication 
between  streams  of  computation  which  are 
proceeding  in  parallel.  Parallel  architectures 
are,  however,  only  of  interest  when  there  must  be 
some  cooperation  and  thus  communication  between 
the  parallel  streams  of  computation.  The  ideal 
communication  system  will  have  zero  latency, 
infinite  bandwidth  and  realize  arbitrary 
topologies  of  communication.  These  are  the 
properties  of  a  paracomputer  [SCH80j.  We  argue 
that  realization  of  these  properties  in  an 
interconnection  network  based  architecture  can  be 
best  approached  through  the  use  of  both  circuit 
and  packet  modes  of  communication.  The  TRAC 
interconnection  network  does  realize  both  modes  of 
data  movement.  Circuit  based  movement  of  data  is 
based  upon  selective  activation  of  circuits  in  the 
switchable  trees.  If  one  processor  deactivates  a 
circuit  in  a  switchable  tree  and  another  processor 
activates  a  circuit  in  that  tree  then  the  entire 
contents  of  that  memory  are  moved  between  the 
address  spaces  of  the  two  processors. 

TRAC  also  implements  a  complete  packet 
switching  capability.  A  processor  may  direct  any 
memory  to  which  it  has  an  active  circuit  to  send  a 
data  packet  selected  from  its  contents  to  any 
other  processor.  In  fact,  all  processors  may 
simultaneously  transmit  packets. 

Effectiveness  of  TRAC  as  a 
Parallel  Architecture 

This  section  demonstrates  how  an  architecture 
built  on  the  design  principles  described  preceding 
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realizes  the  requirements  defined  in  the  first 
section  of  the  paper. 

Realization  of  Multiple  Models  of  Computation 

It  is  obvious  that  TRAC  can  implement  multiple 
models  of  parallel  computation.  There  might, 
however,  be  some  need  to  discuss  the  efficient 
implementation  of  actual  computations.  We  will 
discuss  implementation  of  communication  in  a 
separate  sub-section.  The  TRAC  architecture  can 
incorporate  almost  any  type  or  class  of  processor. 
The  principal  question  to  be  asked  is  the 
efficiency  of  access  to  memory  by  an  individual 
processor.  Analyses  of  the  switch  node  have  shown 
that  it  is  readily  possible  to  construct  nodes 
with  delays  of  the  order  of  a  few  nanoseconds. 
The  length  of  the  path  between  processors  and 
memories  is  bounded  by  log  n  where  n  is  the  number 
of  processors  so  that  total  delay  will  be  not  more 
than  10's  of  nanoseconds  for  even  quite  large 
numbers  of  processors.  This  is  still  much  less 
than  cycle  time  for  memory  cells  appropriate  for 
processors  of  considerable  speed.  It  is  also 
readily  possible  to  include  processor  local  caches 
in  the  TRAC  design  although  we  have  not  done  so. 

There  are  two  other  issues  to  be  addressed. 
These  are  the  costs  of  configuration  and 
reconfiguration  and  the  problems  created  by 
blocking  in  the  network  causing  effective  loss  of 
resources.  The  TRAC  architecture  provides 
hardware  •  support  for  the  formation  of 
configurations  [RAT8G, JEN81 , JEN 82].  Actual 
establishment  of  configurations  can  be 
accomplished  in  microseconds.  The  problems  of 
allocation  of  resources  to  form  configurations  has 
been  extensively  studied  by  modeling  ad  simulation 
[DEG 81  a , PRE81 ].  These  studies  have  consistently 
shown  that  rather  simple  algorithms  are  apparently 
adequate  for  reasonable  levels  of  resource  usage 
and  rates  of  change  of  configurations. 

Communication  and  Data  Movement 

The  combination  of  circuit  switching  and  packet 
switching  mechanisms  for  communication  implemented 
by  TRAC  synergistically  combine  to  give  very  high 
performance  over  a  spectrum  of  requirements  for 
data  movement.  The  switchable  trees  give  a 
mechanism  for  very  high  bandwidth  communication. 
The  deactivation  of  one  circuit  and  the  activation 
of  another  in  a  switchable  tree  can  be 
accomplished  in  a  few  memory  cycle  times.  The 
result  is  the  transfer  between  process  address 
space  of  the  entire  contents  of  a  memory-1/0  cell 
which  may  include  megabytes  of  direct  access 
memory  as  well  as  executable  memory. 

Packets  implement  a  mechanism  of  arbitrary 
topology  but  lower  bandwidth.  Packet  based 
communication  can  be  used  where  the  establishment 
of  circuits  has  an  unacceptably  high  resource  cost 
or  is  simply  not  possible. 

Studies  of  image  processing  [YAS83]  algorithms 
!have  suggested  the  power  of  combining  the’  two 


modes  of  communication  for  interconnection  network 
based  architectures. 

The  1/0  Probi  ero 

A  many  processor  parallel  architectu-e  must 
face  the  problems  of  distributing  data  to 
processors,  maintaining  a  directory  structure, 
maintaining  consistency  and  developing  the 
necessary  bandwidth  of  1/0.  TiiAC  provides  the 
capability  of  placing  a  self-managing  secondary 
memory  (SM3M)  on  every  memory  none.  ,  The  overhead 
of  maintaining  a  central  directory  is  not  required 
since  the  SKSK's  implement  local  directory 
services  which  can  be  combined  to  give  a 
distributed  directory.  Parallel  directory 
searches  can  be  executed  by  having  each  SMSM 
initiate  a  search  on  its  own  portion  of  the 
distributed  directory.  The  availability  of  an  I/O 
port  at  each  memory  site  gives  an  enormous 
opportunity  for  parallel  structuring  of  I/O 
processing.  Thus  TRAC  attains  1/0  bandwidth 
through  parallelism.  The  external  data  necessary 
for  execution  of  a  given  process  can  be  stored  on 
the  SMSM's  local  to  that  process.  Thus  complex 
data  movements  to  complete  I/O  are  avoided. 

Support  for  Software 

Resource  management  in  an  environment  of 
thousands  of  processors,  memories  and  I/O  devices 
is  an  unsolved  problem.  The  overhead  of  resource 
.management  in  resource  sharing  systems  tends  to 
rise  in  a  greater  than  linear  fashion  with  the 
degree  of  competition  for  the  resources  and  the 
mediation  of  conflict  for  resources.  TRAC  avoids 
this  problem  since  sharing  is  all  explicit.  Each 
configuration  of  resources  organized  into  a  given 
computer  architecture  executes  independently  of 
all  other  configurations  because  the  establishment 
of  circuits  partitions  the  resources.  The 
processes  interact  only  when  interactions  are 
programmed  as  a  part  of  the  computation.  DeGroot 
[DEGSIb]  has  discussed  resource  management  for 
TRAC  using  the  hardware  support  provided. 

The  structure  of  the  operating  system  for  TRAC 
has  been  described  elsewhere  [BR082].  Figure  3  is 
a  schematic  of  the  operating  system  for  TRAC. 

The  structure  of  this  operating  system  is 
hierarchical  with  the  functionality  partitioned  on 
a  job-by-job  basis.  This  structure  in  effect 
creates  a  local  operating  system  for  each  job. 
The  local  operating  systems  interact  only  when  job 
configurations  are  established  and  altered.  This 
proposed  implementation  has  a  growth  in  resource 
management  overhead  which  is  linear  in  the  number 
of  jobs  being  executed  on  the  entire  architecture. 

Extensibility 

The  concept  base  for  TRAC  can  be  effective  for 
a  spectrum  of  configurations  of  processors  from  a 
small  to  moderate  number  of  relatively  fast 
processors  to  a  large  number  of  slower  processors. 
Estimates  made  of  loss  of  effectiveness  tin  ougn 


conflict,  and  management  overhead  [JEN82]  suggest 
that  a  configuration  of  a  thousand  processors  each 
in  the  class  of  a  few  MIPS  should  be  efficiently 
realizable  in  both  hardware  and  software. 

The  memory  and  I/O  system  is  modular.  Large 
memories  are  realized  by  combining  smaller  units 
through  circuit  establishment.  Distribution  of 
1/0  capacity  with  memory  units  prevents  the 
development  of  a  data  movement  problem  for  large 
numbers  of  processors. 

The  extensibility  of  TRAC  is  limited  by  the 
cost  growth  of  n  log  n  for  the  switch  nodes  and  by 
the  density  of  wires  in  this  switci  structure,  A 
one-sided  banyan  [GOK79]  has  only  linear  growth  in 
switch  node3  as  n  becomes  largo  and  may  allow  for 
still  larger  configurations  founded  on  the  same 
principles. 


The  discussion  given  preceding  argues  that  the 
design  concepts  of  TRAC  offer  a  cost  effective 
basis  for  realizing  very  high  performance  parallel 
architectures.  The  practicality  of  these  concepts 
has  been  demonstrated  in  the  TRAC  prototype  now 
running  in  Austin  and  through  numerous  modeling 
and  simulation  studies. 
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